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Introduction 


The focus of this book is non-asymptotic theory in high-dimensional statistics. As an area 
of intellectual inquiry, high-dimensional statistics is not new: it has roots going back to the 
seminal work of Rao, Wigner, Kolmogorov, Huber and others, from the 1950s onwards. 
What is new—and very exciting—is the dramatic surge of interest and activity in high- 
dimensional analysis over the past two decades. The impetus for this research is the nature 
of data sets arising in modern science and engineering: many of them are extremely large, 
often with the dimension of the same order as, or possibly even larger than, the sample 
size. In such regimes, classical asymptotic theory often fails to provide useful predictions, 
and standard methods may break down in dramatic ways. These phenomena call for the 
development of new theory as well as new methods. Developments in high-dimensional 
statistics have connections with many areas of applied mathematics—among them machine 
learning, optimization, numerical analysis, functional and geometric analysis, information 
theory, approximation theory and probability theory. The goal of this book is to provide a 
coherent introduction to this body of work. 


1.1 Classical versus high-dimensional theory 


What is meant by the term “high-dimensional”, and why is it important and interesting 
to study high-dimensional problems? In order to answer these questions, we first need to 
understand the distinction between classical as opposed to high-dimensional theory. 
Classical theory in probability and statistics provides statements that apply to a fixed class 
of models, parameterized by an index n that is allowed to increase. In statistical settings, this 
integer-valued index has an interpretation as a sample size. The canonical instance of such 
a theoretical statement is the law of large numbers. In its simplest instantiation, it concerns 
the limiting behavior of the sample mean of n independent and identically distributed d- 
dimensional random vectors {X;}7_,, say, with mean u = E[X,] and a finite variance. The law 
of large numbers guarantees that the sample mean fA, := 1 ;-1 X; converges in probability 
to u. Consequently, the sample mean Â, is a consistent estimator of the unknown population 
mean. A more refined statement is provided by the central limit theorem, which guarantees 
that the rescaled deviation yn (ĝĤ,„ — u) converges in distribution to a centered Gaussian with 
covariance matrix & = cov(X,). These two theoretical statements underlie the analysis of a 
wide range of classical statistical estimators—in particular, ensuring their consistency and 


asymptotic normality, respectively. 


In a classical theoretical framework, the ambient dimension d of the data space is typically 
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viewed as fixed. In order to appreciate the motivation for high-dimensional statistics, it is 
worthwhile considering the following: 


Question Suppose that we are given n = 1000 samples from a statistical model in 
d = 500 dimensions. Will theory that requires n — +00 with the dimension d remaining 
fixed provide useful predictions? 


Of course, this question cannot be answered definitively without further details on the 
model under consideration. Some essential facts that motivate our discussion in this book 
are the following: 


1. The data sets arising in many parts of modern science and engineering have a “high- 
dimensional flavor’, with d on the same order as, or possibly larger than, the sample 
size n. 

2. For many of these applications, classical “large n, fixed d” theory fails to provide useful 
predictions. 

3. Classical methods can break down dramatically in high-dimensional regimes. 


These facts motivate the study of high-dimensional statistical models, as well as the associ- 
ated methodology and theory for estimation, testing and inference in such models. 


1.2 What can go wrong in high dimensions? 


In order to appreciate the challenges associated with high-dimensional problems, it is worth- 
while considering some simple problems in which classical results break down. Accordingly, 
this section is devoted to three brief forays into some examples of high-dimensional phenom- 
ena. 


1.2.1 Linear discriminant analysis 


In the problem of binary hypothesis testing, the goal is to determine whether an observed 
vector x € R? has been drawn from one of two possible distributions, say Pı versus P2. When 
these two distributions are known, then a natural decision rule is based on thresholding the 
log-likelihood ratio log ee varying the setting of the threshold allows for a principled 
trade-off between the two types of errors—namely, deciding P, when the true distribution 
is Pz, and vice versa. The celebrated Neyman—Pearson lemma guarantees that this family of 
decision rules, possibly with randomization, are optimal in the sense that they trace out the 
curve giving the best possible trade-off between the two error types. 

As a special case, suppose that the two classes are distributed as multivariate Gaussians, 
say N(u1, X) and N(u2, X), respectively, differing only in their mean vectors. In this case, the 
log-likelihood ratio reduces to the linear statistic 


HO) = (m -pa E(x A), (1.1) 


where (-, -> denotes the Euclidean inner product in R“. The optimal decision rule is based on 
thresholding this statistic. We can evaluate the quality of this decision rule by computing the 
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probability of incorrect classification. Concretely, if the two classes are equally likely, this 
probability is given by 


Err(¥) := +P\[¥(X’) < 0] + + P2[¥(X”) > 0), 


where X’ and X” are random vectors drawn from the distributions P; and P2, respectively. 
Given our Gaussian assumptions, some algebra shows that the error probability can be writ- 
ten in terms of the Gaussian cumulative distribution function ® as 


1 ae ai 
Err(¥) = Z 1 eT? dt, where y = J(u — p)TE-! (u — p). (1.2) 
VZIT J-0o 


®(-y/2) 

In practice, the class conditional distributions are not known, but instead one observes 
a collection of labeled samples, say {x,,...,%,,} drawn independently from Pı, and 
{Xn,+1.+++>Xn;4n,} drawn independently from P2. A natural approach is to use these sam- 
ples in order to estimate the class conditional distributions, and then “plug” these estimates 
into the log-likelihood ratio. In the Gaussian case, estimating the distributions is equivalent 
to estimating the mean vectors u, and fo, as well as the covariance matrix &, and standard 
estimates are the samples means 


- 1 nı . nyt+n2 
fy c= a Di and fi := a 2: Xi (1.3a) 
i=1 i=n,+1 
as well as the pooled sample covariance matrix 
22 1 nı 1 nı +m 
E := i — ĝi) (xi — Â)" + i — ĝa) (xi — fo)". 1.3b 
T Die far) (ai — fr)" + 28 fir) (x1 — fr) (1.3b) 


Substituting these estimates into the log-likelihood ratio (1.1) yields the Fisher linear dis- 
criminant function 


P(x) = (i - fh, E(x Bth), (1.4) 


Here we have assumed that the sample covariance is invertible, and hence are assuming 
implicitly that n; > d. 

Let us assume that the two classes are equally likely a priori. In this case, the error prob- 
ability obtained by using a zero threshold is given by 


Err(P) := EP [PX < 0] + +PP”) > 0], 


where X’ and X” are samples drawn independently from the distributions Pı and P3, re- 
spectively. Note that the error probability is itself a random variable, since the discriminant 
function ¥ is a function of the samples {Xj} 

In the 1960s, Kolmogorov analyzed a simple version of the Fisher linear discriminant, 
in which the covariance matrix X is known a priori to be the identity, so that the linear 


statistic (1.4) simplifies to 


(1.5) 


F fy E fly + flo 
Pac = (i fin, x p B), 
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Working under an assumption of Gaussian data, he analyzed the behavior of this method 
under a form of high-dimensional asymptotics, in which the triple (7,7, d) all tend to 
infinity, with the ratios d/n;, for i = 1,2, converging to some non-negative fraction a > 0, 
and the Euclidean! distance ||; — p2||2 converging to a constant y > 0. Under this type of 
high-dimensional scaling, he showed that the error Err(¥a) converges in probability to a 
fixed number—in particular, 


~ prob. y? 

Err(Pa) 25 o( x |. (1.6) 
2Vy2 + 2a 

where (ft) := P[Z < t] is the cumulative distribution function of a standard normal variable. 

Thus, if d/n; — 0, then the asymptotic error probability is simply ®(—y/2), as is predicted by 

classical scaling (1.2). However, when the ratios d/n; converge to a strictly positive number 

a > 0, then the asymptotic error probability is strictly larger than the classical prediction, 
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Figure 1.1 (a) Plots of the error probability Err(¥;a) versus the mean shift parameter 
y € [1,2] for d = 400 and fraction œ = 0.5, so that nı = n2 = 800. Gray circles cor- 
respond to the empirical error probabilities, averaged over 50 trials and confidence 
bands shown with plus signs, as defined by three times the standard error. The solid 
curve gives the high-dimensional prediction (1.6), whereas the dashed curve gives 


the classical prediction (1.2). (b) Plots of the error probability Err(¥ig) versus the 
fraction a € [0,1] for d = 400 and y = 2. In this case, the classical prediction 
@(-y/2) plotted as a dashed line remains flat, since it is independent of a. 


Recalling our original motivating question from Section 1.1, it is natural to ask whether 
the error probability of the test Bia, for some finite triple (d, nı, n2), is better described by the 
classical prediction (1.2), or the high-dimensional analog (1.6). In Figure 1.1, we plot com- 
parisons between the empirical behavior and theoretical predictions for different choices 
of the mean shift parameter y and limiting fraction a. Figure 1.1(a) shows plots of the 
error probability Err(¥;a) versus the mean shift parameter y for dimension d = 400 and 
fraction a = 0.5, meaning that nı = n) = 800. Gray circles correspond to the empirical 


' We note that the Mahalanobis distance from equation (1.2) reduces to the Euclidean distance when È = I4. 
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performance averaged over 50 trials, whereas the solid and dashed lines correspond to the 
high-dimensional and classical predictions, respectively. Note that the high-dimensional pre- 
diction (1.6) with a = 0.5 shows excellent agreement with the behavior in practice, whereas 
the classical prediction ®(—y) drastically underestimates the error rate. Figure 1.1(b) shows 
a similar plot, again with dimension d = 400 but with y = 2 and the fraction «œ ranging in 
the interval [0.05, 1]. In this case, the classical prediction is flat, since it has no dependence 
on œ. Once again, the empirical behavior shows good agreement with the high-dimensional 
prediction. 

A failure to take into account high-dimensional effects can also lead to sub-optimality. A 
simple instance of this phenomenon arises when the two fractions d/n;, i = 1,2, converge 
to possibly different quantities a; > 0 for i = 1,2. For reasons to become clear shortly, it 
is natural to consider the behavior of the discriminant function Bia for a general choice of 
threshold t € R, in which case the associated error probability takes the form 


Err,(Pja) = +P [Pua < A + 4P2[Pia(X”) > A, (1.7) 


where X’ and X” are again independent samples from P, and P2, respectively. For this set- 
up, it can be shown that 


Err,(¥ia) — 50 : 


$ 


prob. | re), 50 y — 2t — (a — a2) 


2Vy +a, +a 2Vy? +a) + a2 


a formula which reduces to the earlier expression (1.6) in the special case when a] = a2 = œ 
and t = 0. Due to the additional term a; — a2, whose sign differs between the two terms, the 
choice t = 0 is no longer asymptotically optimal, even though we have assumed that the two 
classes are equally likely a priori. Instead, the optimal choice of the threshold is t = “5%, a 
choice that takes into account the different sample sizes between the two classes. 


1.2.2 Covariance estimation 


We now turn to an exploration of high-dimensional effects for the problem of covariance 
estimation. In concrete terms, suppose that we are given a collection of random vectors 
{x1,...,Xn}, where each x; is drawn in an independent and identically distributed (i.i.d.) 
manner from some zero-mean distribution in Rf, and our goal is to estimate the unknown 
covariance matrix & = cov(X). A natural estimator is the sample covariance matrix 


ee 
Hue De (1.8) 


a d x d random matrix corresponding to the sample average of the outer products 
xix} e R**, By construction, the sample covariance È is an unbiased estimate, meaning 
that E[£] = X. 

A classical analysis considers the behavior of the sample covariance matrix Las the sam- 
ple size n increases while the ambient dimension d stays fixed. There are different ways 
in which to measure the distance between the random matrix © and the population covari- 
ance matrix Ł, but, regardless of which norm is used, the sample covariance is a consistent 
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estimate. One useful matrix norm is the -operator norm, given by 


(3 — fp = E (9) 

u#0 Ilzl2 
Under mild moment conditions, an argument based on the classical law of large numbers 
can be used to show that the difference IE — ||, converges to zero almost surely as n > 
œ. Consequently, the sample covariance is a strongly consistent estimate of the population 
covariance in the classical setting. 

Is this type of consistency preserved if we also allow the dimension d to tend to infinity? 
In order to pose the question more crisply, let us consider sequences of problems Œ, E) 
indexed by the pair (n, d), and suppose that we allow both n and d to increase with their 
ratio remaining fixed—in particular, say d/n = a € (0, 1). In Figure 1.2, we plot the results 
of simulations for a random ensemble X = Iy, with each X; ~ N(O,1,) fori = 1,...,n. Using 
these n samples, we generated the sample covariance matrix (1.8), and then computed its 
vector of eigenvalues yd) € Rf, say arranged in non-increasing order as 


Ymax(L) = V1) > VE) > +++ > yE) = Ymin(Z) > 0. 


Each plot shows a histogram of the vector y(d) € R? of eigenvalues: Figure 1.2(a) corre- 
sponds to the case (n,d) = (4000, 800) or œ = 0.2, whereas Figure 1.2(b) shows the pair 
(n,d) = (4000, 2000) or a = 0.5. If the sample covariance matrix were converging to the 
identity matrix, then the vector of eigenvalues y(d) should converge to the all-ones vec- 
tor, and the corresponding histograms should concentrate around 1. Instead, the histograms 
in both plots are highly dispersed around 1, with differing shapes depending on the aspect 
ratios. 


Empirical vs MP law (a = 0.2) 


Empirical vs MP law (a = 0.5) 
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Figure 1.2 Empirical distribution of the eigenvalues of a sample covariance ma- 
trix È versus the asymptotic prediction of the Maréenko—Pastur law. It is speci- 


fied by a density of the form fmp(y) « ae supported on the in- 


terval [tmin(@), tmax(@)] = [A - Va)’, (1 + Ya)’]. (a) Aspect ratio a = 0.2 and 
(n,d) = (4000, 800). (b) Aspect ratio a = 0.5 and (n,d) = (4000, 2000). In both 
cases, the maximum eigenvalue ymax(X) is very close to (1 + va)’, consistent with 
theory. 


These shapes—if we let both the sample size and dimension increase in such a way that 
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d/n — a € (0, 1)—are characterized by an asymptotic distribution known as the Maréenko-— 
Pastur law. Under some mild moment conditions, this theory predicts convergence to a 
strictly positive density supported on the interval [tmin(@), tmax(@)], where 


fmin(@) = (1 — Ya)? and trx(@) := (1+ vay. (1.10) 


See the caption of Figure 1.2 for more details. 

The Maréenko-—Pastur law is an asymptotic statement, albeit of a non-classical flavor since 
it allows both the sample size and dimension to diverge. By contrast, the primary focus of 
this book are results that are non-asymptotic in nature—that is, in the current context, we 
seek results that hold for all choices of the pair (n, d), and that provide explicit bounds on 
the events of interest. For example, as we discuss at more length in Chapter 6, in the setting 
of Figure 1.2, it can be shown that the maximum eigenvalue Ymax(2) satisfies the upper 
deviation inequality 


Ply max) > (1 + yd/n+6P]<e"? forall d >0, (1.11) 


with an analogous lower deviation inequality for the minimum eigenvalue Ymin(2) in the 
regime n > d. This result gives us more refined information about the maximum eigenvalue, 
showing that the probability that it deviates above (1 + Vd/n)* is exponentially small in 
the sample size n. In addition, this inequality (and related results) can be used to show that 
the sample covariance matrix Lis an operator-norm-consistent estimate of the population 
covariance matrix X as long as d/n —> 0. 


1.2.3 Nonparametric regression 


The effects of high dimensions on regression problems can be even more dramatic. In one 
instance of the problem known as nonparametric regression, we are interested in estimating 
a function from the unit hypercube [0, 1]“ to the real line R; this function can be viewed 
as mapping a vector x € [0,1] of predictors or covariates to a scalar response variable 
y € k. If we view the pair (X, Y) as random variables, then we can ask for the function f that 
minimizes the least-squares prediction error E[(Y — f(X))*]. An easy calculation shows that 
the optimal such function is defined by the conditional expectation f(x) = E[Y | x], and it is 
known as the regression function. 

In practice, the joint distribution Py y of (X, Y) is unknown, so that computing f directly is 
not possible. Instead, we are given samples (X;, Y;) fori = 1,...,n, drawn in an i.i.d. manner 
from Pyy, and our goal is to find a function f for which the mean-squared error (MSE) 


IF- FIP. = EKF - fX (1.12) 


is as small as possible. 

It turns out that this problem becomes extremely difficult in high dimensions, a manifes- 
tation of what is known as the curse of dimensionality. This notion will be made precise 
in our discussion of nonparametric regression in Chapter 13. Here, let us do some simple 
simulations to address the following question: How many samples n should be required as 
a function of the problem dimension d? For concreteness, let us suppose that the covariate 
vector X is uniformly distributed over [0, 1]“, so that Px is the uniform distribution, de- 
noted by Uni([0, 1]“). If we are able to generate a good estimate of f based on the samples 
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X,...,Xn, then it should be the case that a typical vector X’ € [0, 1] is relatively close to at 
least one of our samples. To formalize this notation, we might study the quantity 


poo(n,d) == Ex x| min IX - Xi} (1.13) 


which measures the average distance between an independently drawn sample X’, again 
from the uniform distribution Uni((0, 1]¢), and our original data set {X,,...,X)}. 

How many samples n do we need to collect as a function of the dimension d so as to ensure 
that p..(n, d) falls below some threshold ô? For illustrative purposes, we use 6 = 1/3 in the 
simulations to follow. As in the previous sections, let us first consider a scaling in which the 
ratio d/n converges to some constant a > 0, say a = 0.5 for concreteness, so that n = 2d. 
Figure 1.3(a) shows the results of estimating the quantity p,.(2d, d) on the basis of 20 trials. 
As shown by the gray circles, in practice, the closest point (on average) to a data set based 
onn = 2d samples tends to increase with dimension, and certainly stays bounded above 1/3. 
What happens if we try a more aggressive scaling of the sample size? Figure 1.3(b) shows 
the results of the same experiments with n = d? samples; again, the minimum distance tends 
to increase as the dimension increases, and stays bounded well above 1/3. 
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Figure 1.3 Behavior of the quantity o..(n,d) versus the dimension d, for different 
scalings of the pair (n, d). Full circles correspond to the average over 20 trials, with 
confidence bands shown with plus signs, whereas the solid curve provides the theo- 
retical lower bound (1.14). (a) Behavior of the variable 0..(2d, d). (b) Behavior of the 
variable pæ (d’, d). In both cases, the expected minimum distance remains bounded 
above 1/3, corresponding to log(1/3) ~ —1.1 (horizontal dashed line) on this loga- 
rithmic scale. 


In fact, we would need to take an exponentially large sample size in order to ensure that 
Poo(n, d) remained below 6 as the dimension increased. This fact can be confirmed by proving 
the lower bound 

d logn 

log Poo(n, d) = log ———~ — , 

O8 Poo(N, d) Bash d 

which implies that a sample size n > (1/6) is required to ensure that the upper bound 
Poo(n, d) < 6 holds. We leave the proof of the bound (1.14) as an exercise for the reader. 


(1.14) 
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We have chosen to illustrate this exponential explosion in a randomized setting, where 
the covariates X are drawn uniformly from the hypercube [0, 1]“. But the curse of dimen- 
sionality manifests itself with equal ferocity in the deterministic setting, where we are given 
the freedom of choosing some collection {x;}7_, of vectors in the hypercube [0, 1]. Let us 
investigate the minimal number n required to ensure that any vector x’ € [0, 1]? is at most 
distance 6 in the f,,-norm to some vector in our collection—that is, such that 


sup min ||x’ — xilko < ô. (1.15) 
ve[0,1]4 i=l,...,n 


The most straightforward way of ensuring this approximation quality is by a uniform grid- 
ding of the unit hypercube: in particular, suppose that we divide each of the d sides of the 
cube into [1/(26)] sub-intervals,” each of length 26. Taking the Cartesian products of these 
sub-intervals yields a total of [1/(26)]¢ boxes. Placing one of our points x; at the center of 
each of these boxes yields the desired approximation (1.15). 

This construction provides an instance of what is known as a 6-covering of the unit hyper- 
cube in the f,,-norm, and we see that its size must grow exponentially in the dimension. By 
studying a related quantity known as a 6-packing, this exponential scaling can be shown to 
be inescapable—that is, there is not a covering set with substantially fewer elements. See 
Chapter 5 for a much more detailed treatment of the notions of packing and covering. 


1.3 What can help us in high dimensions? 


An important fact is that the high-dimensional phenomena described in the previous sections 
are all unavoidable. Concretely, for the classification problem described in Section 1.2.1, if 
the ratio d/n stays bounded strictly above zero, then it is not possible to achieve the optimal 
classification rate (1.2). For the covariance estimation problem described in Section 1.2.2, 
there is no consistent estimator of the covariance matrix in £;-operator norm when d/n re- 
mains bounded away from zero. Finally, for the nonparametric regression problem in Sec- 
tion 1.2.3, given the goal of estimating a differentiable regression function f, no consistent 
procedure is possible unless the sample size n grows exponentially in the dimension d. All 
of these statements can be made rigorous via the notions of metric entropy and minimax 
lower bounds, to be developed in Chapters 5 and 15, respectively. 

Given these “no free lunch” guarantees, what can help us in the high-dimensional setting? 
Essentially, our only hope is that the data is endowed with some form of low-dimensional 
structure, one which makes it simpler than the high-dimensional view might suggest. Much 
of high-dimensional statistics involves constructing models of high-dimensional phenomena 
that involve some implicit form of low-dimensional structure, and then studying the statisti- 
cal and computational gains afforded by exploiting this structure. In order to illustrate, let us 
revisit our earlier three vignettes, and show how the behavior can change dramatically when 
low-dimensional structure is present. 


2 Here [a] denotes the ceiling of a, or the smallest integer greater than or equal to a. 
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1.3.1 Sparsity in vectors 


Recall the simple classification problem described in Section 1.2.1, in which, for j = 1,2, 
we observe n ; samples of a multivariate Gaussian with mean u; € R? and identity covariance 
matrix I4. Setting n = nı = m, let us recall the scaling in which the ratios d/n; are fixed 
to some number a € (0, co). What is the underlying cause of the inaccuracy of the classical 
prediction shown in Figure 1.1? Recalling that ĝ; denotes the sample mean of the n; samples, 
the squared Euclidean error ||fz; — u ills turns out to concentrate sharply around =a. This 


fact is a straightforward consequence of the chi-squared (x°) tail bounds to be developed in 
Chapter 2—in particular, see Example 2.11. When œ > 0, there is a constant level of error, 
for which reason the classical prediction (1.2) of the error rate is overly optimistic. 

But the sample mean is not the only possible estimate of the true mean: when the true 
mean vector is equipped with some type of low-dimensional structure, there can be much 
better estimators. Perhaps the simplest form of structure is sparsity: suppose that we knew 
that each mean vector u; were relatively sparse, with only s of its d entries being non- 
zero, for some sparsity parameter s « d. In this case, we can obtain a substantially better 
estimator by applying some form of thresholding to the sample means. As an example, for a 
given threshold level 2 > 0, the hard-thresholding estimator is given by 


( if |x| > A, 
A(x) = xI[|x| > 4] = (1.16) 


0 otherwise, 


where [[|x| > A] is a 0-1 indicator for the event {|x| > A}. As shown by the solid curve in 
Figure 1.4(a), it is a “keep-or-kill” function that zeroes out x whenever its absolute value 
falls below the threshold A, and does nothing otherwise. A closely related function is the 
soft-thresholding operator 


T(x) = I[|x| > A(x — Asign(x)) = 


( ~Asign(x) if |x| >a, ani 


otherwise. 


As shown by the dashed line in Figure 1.4(a), it has been shifted so as to be continuous, in 
contrast to the hard-thresholding function. 

In the context of our classification problem, instead of using the sample means fi; in 
the plug-in classification rule (1.5), suppose that we used hard-thresholded versions of the 
sample means—namely 

2logd 


Hi =H, Q; for j= 1,2 where 4, := i (1.18) 
n 


Standard tail bounds to be developed in Chapter 2—see Exercise 2.12 in particular—will 
illuminate why this particular choice of threshold 4, is a good one. Using these thresholded 
estimates, we can then implement a classifier based on the linear discriminant 


(1.19) 


ies i E TU +U 
(x) = (i - fh, s- ASR, 


In order to explore the performance of this classifier, we performed simulations using the 
same parameters as those in Figure 1.1(a); Figure 1.4(b) gives a plot of the error Err(¥) 
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Figure 1.4 (a) Plots of the hard-thresholding and soft-thresholding functions at some 
level A > 0. (b) Plots of the error probability Err(¥ia) versus the mean shift parameter 
y € [1,2] with the same set-up as the simulations in Figure 1.1: dimension d = 400, 
and sample sizes n = nı = m = 800. In this case, the mean vectors u, and pz each 
had s = 5 non-zero entries, and the classification was based on hard-thresholded 


2logd : 
-—. Gray circles correspond 


to the empirical error probabilities, averaged over 50 trials and confidence intervals 
defined by three times the standard error. The solid curve gives the high-dimensional 
prediction (1.6), whereas the dashed curve gives the classical prediction (1.2). In 
contrast to Figure 1.1(a), the classical prediction is now accurate. 


versions of the sample means at the level 2, = 


versus the mean shift y. Overlaid for comparison are both the classical (1.2) and high- 
dimensional (1.6) predictions. In contrast to Figure 1.1(a), the classical prediction now gives 
an excellent fit to the observed behavior. In fact, the classical limit prediction is exact when- 
ever the ratio log (£) n approaches zero. Our theory on sparse vector estimation in Chapter 7 
can be used to provide a rigorous justification of this claim. 


1.3.2 Structure in covariance matrices 


In Section 1.2.2, we analyzed the behavior of the eigenvalues of a sample covariance matrix 
E based on n samples of a d-dimensional random vector with the identity matrix as its 
covariance. As shown in Figure 1.2, when the ratio d/n remains bounded away from zero, the 
sample eigenspectrum yŒ) remains highly dispersed around 1, showing that Lis nota good 
estimate of the population covariance matrix X = I4. Again, we can ask the questions: What 
types of low-dimensional structure might be appropriate for modeling covariance matrices? 
And how can they can be exploited to construct better estimators? 

As a very simple example, suppose that our goal is to estimate a covariance matrix known 
to be diagonal. It is then intuitively clear that the sample covariance matrix can be im- 
proved by zeroing out its non-diagonal entries, leading to the diagonal covariance estimate 
D. A little more realistically, if the covariance matrix X were assumed to be sparse but the 
positions were unknown, then a reasonable estimator would be the hard-thresholded ver- 


sion È := Ti) of the sample covariance, say with 2, = af as before. Figure 1.5(a) 
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shows the resulting eigenspectrum yŒ) of this estimator with aspect ratio a = 0.2 and 
(n, d) = (4000, 800)—that is, the same settings as Figure 1 .2(a). In contrast to the Maréenko— 
Pastur behavior shown in the former figure, we now see that the eigenspectrum yE) is 
sharply concentrated around the point mass at 1. Tail bounds and theory from Chapters 2 


and 6 can be used to show that [|Z — Ell < 4/ “4 with high probability. 
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Figure 1.5 (a) Behavior of the eigenspectrum y(X) for a hard-thresholded version 
of the sample covariance matrix. Unlike the sample covariance matrix itself, it can 
be a consistent estimator of a sparse covariance matrix even for scalings such that 
d/n = a > O. (b) Behavior of the sample covariance matrix for estimating se- 
quences of covariance matrices of increasing dimension but all satisfying the con- 
straint trace(Z) < 20. Consistent with theoretical predictions, the operator norm error 


IE — Zlll2 for this sequence decays at the rate 1/yn, as shown by the solid line on the 
log-log plot. 


An alternative form of low-dimensional structure for symmetric matrices is that of fast 
decay in their eigenspectra. If we again consider sequences of problems indexed by (n, d), 
suppose that our sequence of covariance matrices have a bounded trace—that is, trace(X) < 
R, independent of the dimension d. This requirement means that the ordered eigenvalues 
y;(2) must decay a little more quickly than j~'. As we discuss in Chapter 10, these types 
of eigendecay conditions hold in a variety of applications. Figure 1.5(b) shows a log—log 
plot of the operator norm error x — ll, over a range of pairs (n, d), all with the fixed ratio 
d/n = 0.2, for a sequence of covariance matrices that all satisfy the constraint trace(Z) < 
20. Theoretical results to be developed in Chapter 6 predict that, for such a sequence of 
covariance matrices, the error I£- Elli should decay as n~!/?, even if the dimension d grows 
in proportion to the sample size n. See also Chapters 8 and 10 for discussion of other forms 
of matrix estimation in which these types of rank or eigendecay constraints play a role. 


1.3.3 Structured forms of regression 


As discussed in Section 1.2.3, a generic regression problem in high dimensions suffers from 
a severe curse of dimensionality. What type of structure can alleviate this curse? There are 


1.3 What can help us in high dimensions? 13 


various forms of low-dimensional structure that have been studied in past and on-going work 
on high-dimensional regression. 

One form of structure is that of an additive decomposition in the regression function—say 
of the form 


d 
Feio = > aids (1.20) 
j=l 


where each univariate function g;: R — R is chosen from some base class. For such func- 
tions, the problem of regression is reduced to estimating a collection of d separate univariate 
functions. The general theory developed in Chapters 13 and 14 can be used to show how the 
additive assumption (1.20) largely circumvents? the curse of dimensionality. A very special 
case of the additive decomposition (1.20) is the classical linear model, in which, for each 
j=1,...,d, the univariate function takes the form g;(x;) = 0;x; for some coefficients 6; € R. 
More generally, we might assume that each g; belongs to a reproducing kernel Hilbert space, 
a class of function spaces studied at length in Chapter 12. 

Assumptions of sparsity also play an important role in the regression setting. The sparse 
additive model (SPAM) is based on positing the existence of some subset $ C {1,2,...,d} 
of cardinality s = |S | such that the regression function can be decomposed as 


Tarand a YB): (1.21) 
jes 
In this model, there are two different classes of objects to be estimated: (i) the unknown 
subset S that ranges over all (‘) possible subsets of size s; and (ii) the univariate functions 
{g;, j € S} associated with this subset. A special case of the SPAM decomposition (1.21) is 
the sparse linear model, in which f(x) = > 6;x; for some vector @ € R? that is s-sparse. 
See Chapter 7 for a detailed discussion of this class of models, and the conditions under 
which accurate estimation is possible even when d > n. 
There are a variety of other types of structured regression models to which the meth- 
ods and theory developed in this book can be applied. Examples include the multiple-index 
model, in which the regression function takes the form 


fai,- Xa) = AA), (1.22) 


for some matrix A € R, and function h: R — R. The single-index model is the special 


case of this model with s = 1, so that f(x) = h(a, xy) for some vector a € R?. Another 
special case of this more general family is the SPAM class (1.21): it can be obtained by 
letting the rows of A be the standard basis vectors {e;, j € S}, and letting the function h 
belong to the additive class (1.20). 

Taking sums of single-index models leads to a method known as projection pursuit re- 
gression, involving functions of the form 


M 
fxi... Xa) = $ gi((aj.x)), (1.23) 
j=l 


M 


for some collection of univariate functions {g ijn , and a collection of d vectors {a ae 1- Such 


3 In particular, see Exercise 13.9, as well as Examples 14.11 and 14.14. 
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models can also help alleviate the curse of dimensionality, as long as the number of terms 
M can be kept relatively small while retaining a good fit to the regression function. 


1.4 What is the non-asymptotic viewpoint? 


As indicated by its title, this book emphasizes non-asymptotic results in high-dimensional 
statistics. In order to put this emphasis in context, we can distinguish between at least three 
types of statistical analysis, depending on how the sample size behaves relative to the di- 
mension and other problem parameters: 


e Classical asymptotics. The sample size n is taken to infinity, with the dimension d and 
all other problem parameters remaining fixed. The standard laws of large numbers and 
central limit theorem are examples of this type of theory. 

e High-dimensional asymptotics. The pair (n,d) is taken to infinity simultaneously, while 
enforcing that, for some scaling function ¥, the sequence ¥(n, d) remains fixed, or con- 
verges to some value a € [0,00]. For example, in our discussions of linear discrimi- 
nant analysis (Section 1.2.1) and covariance estimation (Section 1.2.2), we considered 
such scalings with the function Y(n,d) = d/n. More generally, the scaling function 
might depend on other problem parameters in addition to (n, d). For example, in study- 
ing vector estimation problems involving a sparsity parameter s, the scaling function 
P(n, d,s) = log (‘) /n might be used. Here the numerator reflects that there are a) possible 
subsets of cardinality s contained in the set of all possible indices {1,2,...,d}. 

e Non-asymptotic bounds. The pair (n, d), as well as other problem parameters, are viewed 
as fixed, and high-probability statements are made as a function of them. The previously 
stated bound (1.11) on the maximum eigenvalue of a sample covariance matrix is a stan- 
dard example of such a result. Results of this type—that is, tail bounds and concentration 


inequalities on the performance of statistical estimators—are the primary focus of this 
book. 


To be clear, these modes of analysis are closely related. Tail bounds and concentration 
inequalities typically underlie the proofs of classical asymptotic theorems, such as almost 
sure convergence of a sequence of random variables. Non-asymptotic theory can be used 
to predict some aspects of high-dimensional asymptotic phenomena—for instance, it can 
be used to derive the limiting forms of the error probabilities (1.6) for linear discriminant 
analysis. In random matrix theory, it can be used to establish that the sample eigenspectrum 
of a sample covariance matrix with d/n = a lies within‘ the interval [(1 — Va)’, (1 + Va)7] 
with probability one as (n,d) grow—cf. Figure 1.2. Finally, the functions that arise in a 
non-asymptotic analysis can suggest appropriate forms of scaling functions ¥ suitable for 
performing a high-dimensional asymptotic analysis so as to unveil limiting distributional 
behavior. 

One topic not covered in this book—due to space constraints—is an evolving line of 
work that seeks to characterize the asymptotic behavior of low-dimensional functions of a 
given high-dimensional estimator; see the bibliography in Section 1.6 for some references. 


4 To be clear, it does not predict the precise shape of the distribution on this interval, as given by the 
Maréenko-Pastur law. 
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For instance, in sparse vector estimation, one natural goal is to seek a confidence inter- 
val for a given coordinate of the d-dimensional vector. At the heart of such analyses are 
non-asymptotic tail bounds, which allow for control of residuals within the asymptotics. 
Consequently, the reader who has mastered the techniques laid out in this book will be well 
equipped to follow these types of derivations. 


1.5 Overview of the book 


With this motivation in hand, let us now turn to a broad overview of the structure of this 
book, as well as some suggestions regarding its potential use in a teaching context. 


1.5.1 Chapter structure and synopses 


The chapters follow a rough division into two types: material on Tools and techniques (TT), 
and material on Models and estimators (ME). Chapters of the TT type are foundational in 
nature, meant to develop techniques and derive theory that is broadly applicable in high- 
dimensional statistics. The ME chapters are meant to be complementary in nature: each 
such chapter focuses on a particular class of statistical estimation problems, and brings to 
bear the methods developed in the foundational chapters. 


Tools and techniques 


e Chapter 2: This chapter provides an introduction to standard techniques in deriving tail 
bounds and concentration inequalities. It is required reading for all other chapters in the 
book. 

e Chapter 3: Following directly from Chapter 2, this chapter is devoted to more advanced 
material on concentration of measure, including the entropic method, log-Sobolev in- 
equalities, and transportation cost inequalities. It is meant for the reader interested in a 
deeper understanding of the concentration phenomenon, but is not required reading for 
the remaining chapters. The concentration inequalities in Section 3.4 for empirical pro- 
cesses are used in later analysis of nonparametric models. 

e Chapter 4: This chapter is again required reading for most other chapters, as it introduces 
the foundational ideas of uniform laws of large numbers, along with techniques such as 
symmetrization, which leads naturally to the Rademacher complexity of a set. It also cov- 
ers the notion of Vapnik—Chervonenkis (VC) dimension as a particular way of bounding 
the Rademacher complexity. 

e Chapter 5: This chapter introduces the geometric notions of covering and packing in met- 
ric spaces, along with the associated discretization and chaining arguments that underlie 
proofs of uniform laws via entropic arguments. These arguments, including Dudley’s en- 
tropy integral, are required for later study of nonparametric models in Chapters 13 and 14. 
Also covered in this chapter are various connections to Gaussian processes, including the 
Sudakov—Fernique and Gordon—Slepian bounds, as well as Sudakov’s lower bound. 

e Chapter 12: This chapter provides a self-contained introduction to reproducing kernel 
Hilbert spaces, including material on kernel functions, Mercer’s theorem and eigenvalues, 
the representer theorem, and applications to function interpolation and estimation via ker- 
nel ridge regression. This material is not a prerequisite for reading Chapters 13 and 14, 
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but is required for understanding the kernel-based examples covered in these chapters on 
nonparametric problems. 

Chapter 14: This chapter follows the material from Chapters 4 and 13, and is devoted to 
more advanced material on uniform laws, including an in-depth analysis of two-sided and 
one-sided uniform laws for the population and empirical L?-norms. It also includes some 
extensions to certain Lipschitz cost functions, along with applications to nonparametric 
density estimation. 

Chapter 15: This chapter provides a self-contained introduction to techniques for proving 
minimax lower bounds, including in-depth discussions of Le Cam’s method in both its 
naive and general forms, the local and Yang—Barron versions of the Fano method, along 
with various examples. It can be read independently of any other chapter, but does make 
reference (for comparison) to upper bounds proved in other chapters. 


Models and estimators 


Chapter 6: This chapter is devoted to the problem of covariance estimation. It develops 
various non-asymptotic bounds for the singular values and operator norms of random ma- 
trices, using methods based on comparison inequalities for Gaussian matrices, discretiza- 
tion methods for sub-Gaussian and sub-exponential variables, as well as tail bounds of the 
Ahlswede—Winter type. It also covers the estimation of sparse and structured covariance 
matrices via thresholding and related techniques. Material from Chapters 2, 4 and 5 is 
needed for a full understanding of the proofs in this chapter. 

Chapter 7: The sparse linear model is possibly the most widely studied instance of a 
high-dimensional statistical model, and arises in various applications. This chapter is de- 
voted to theoretical results on the behavior of ¢,-relaxations for estimating sparse vectors, 
including results on exact recovery for noiseless models, estimation in f2-norm and pre- 
diction semi-norms for noisy models, as well as results on variable selection. It makes 
substantial use of various tail bounds from Chapter 2. 

Chapter 8: Principal component analysis is a standard method in multivariate data analy- 
sis, and exhibits a number of interesting phenomena in the high-dimensional setting. This 
chapter is devoted to a non-asymptotic study of its properties, in both its unstructured and 
sparse versions. The underlying analysis makes use of techniques from Chapters 2 and 6. 
Chapter 9: This chapter develops general techniques for analyzing estimators that are 
based on decomposable regularizers, including the f;-norm and nuclear norm as special 
cases. It builds on the material on sparse linear regression from Chapter 7, and makes uses 
of techniques from Chapters 2 and 4. 

Chapter 10: There are various applications that involve the estimation of low-rank matri- 
ces in high dimensions, and this chapter is devoted to estimators based on replacing the 
rank constraint with a nuclear norm penalty. It makes direct use of the framework from 
Chapter 9, as well as tail bounds and random matrix theory from Chapters 2 and 6. 
Chapter 11: Graphical models combine ideas from probability theory and graph theory, 
and are widely used in modeling high-dimensional data. This chapter addresses various 
types of estimation and model selection problems that arise in graphical models. It re- 
quires background from Chapters 2 and 7. 

Chapter 13: This chapter is devoted to an in-depth analysis of least-squares estimation 
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in the general nonparametric setting, with a broad range of examples. It exploits tech- 
niques from Chapters 2, 4 and 5, along with some concentration inequalities for empirical 
processes from Chapter 3. 


1.5.2 Recommended background 


This book is targeted at graduate students with an interest in applied mathematics broadly 
defined, including mathematically oriented branches of statistics, computer science, electri- 
cal engineering and econometrics. As such, it assumes a strong undergraduate background 
in basic aspects of mathematics, including the following: 


e A course in linear algebra, including material on matrices, eigenvalues and eigendecom- 
positions, singular values, and so on. 

e A course in basic real analysis, at the level of Rudin’s elementary book (Rudin, 1964), 
covering convergence of sequences and series, metric spaces and abstract integration. 

e A course in probability theory, including both discrete and continuous variables, laws of 
large numbers, as well as central limit theory. A measure-theoretic version is not required, 
but the ability to deal with the abstraction of this type is useful. Some useful books include 
Breiman (1992), Chung (1991), Durrett (2010) and Williams (1991). 

e Acourse in classical mathematical statistics, including some background on decision the- 
ory, basics of estimation and testing, maximum likelihood estimation and some asymp- 
totic theory. Some standard books at the appropriate level include Keener (2010), Bickel 
and Doksum (2015) and Shao (2007). 


Probably the most subtle requirement is a certain degree of mathematical maturity on the 
part of the reader. This book is meant for the person who is interested in gaining a deep un- 
derstanding of the core issues in high-dimensional statistics. As with anything worthwhile in 
life, doing so requires effort. This basic fact should be kept in mind while working through 
the proofs, examples and exercises in this book. 


At the same time, this book has been written with self-study and/or teaching in mind. To wit, 
we have often sacrificed generality or sharpness in theorem statements for the sake of proof 
clarity. In lieu of an exhaustive treatment, our primary emphasis is on developing techniques 
that can be used to analyze many different problems. To this end, each chapter is seeded 
with a large number of examples, in which we derive specific consequences of more abstract 
statements. Working through these examples in detail, as well as through some of the many 
exercises at the end of each chapter, is the best way to gain a robust grasp of the material. 
As a warning to the reader: the exercises range in difficulty from relatively straightforward 
to extremely challenging. Don’t be discouraged if you find an exercise to be challenging; 
some of them are meant to be! 


1.5.3 Teaching possibilities and a flow diagram 


This book has been used for teaching one-semester graduate courses on high-dimensional 
Statistics at various universities, including the University of California Berkeley, Carnegie 
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Mellon University, Massachusetts Institute of Technology and Yale University. The book 
has far too much material for a one-semester class, but there are various ways of working 
through different subsets of chapters over time periods ranging from five to 15 weeks. See 
Figure 1.6 for a flow diagram that illustrates some of these different pathways through the 
book. 


Chap. 12 


Figure 1.6 A flow diagram of Chapters 2-15 and some of their dependence struc- 
ture. Various tours of subsets of chapters are possible; see the text for more details. 


A short introduction. Given a shorter period of a few weeks, it would be reasonable to cover 
Chapter 2 followed by Chapter 7 on sparse linear regression, followed by parts of Chapter 6 
on covariance estimation. Other brief tours beginning with Chapter 2 are also possible. 


A longer look. Given a few more weeks, a longer look could be obtained by supplementing 
the short introduction with some material from Chapter 5 on metric entropy and Dudley’s 
entropy integral, followed by Chapter 13 on nonparametric least squares. This supplement 
would give a taste of the nonparametric material in the book. Alternative additions are pos- 
sible, depending on interests. 


A full semester course. A semester-length tour through the book could include Chapter 2 on 
tail bounds, Chapter 4 on uniform laws, the material in Sections 5.1 through 5.3.3 on metric 
entropy through to Dudley’s entropy integral, followed by parts of Chapter 6 on covariance 
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estimation, Chapter 7 on sparse linear regression, and Chapter 8 on principal component 
analysis. A second component of the course could consist of Chapter 12 on reproducing 
kernel Hilbert spaces, followed by Chapter 13 on nonparametric least squares. Depending 
on the semester length, it could also be possible to cover some material on minimax lower 
bounds from Chapter 15. 


1.6 Bibliographic details and background 


Rao (1949) was one of the first authors to consider high-dimensional effects in two-sample 
testing problems. The high-dimensional linear discriminant problem discussed in Section 
1.2.1 was first proposed and analyzed by Kolmogorov in the 1960s. Deev, working in the 
group of Kolmogorov, analyzed the high-dimensional asymptotics of the general Fisher lin- 
ear discriminant for fractions a; € [0, 1). See the book by Serdobolskii (2000) and the survey 
paper by Raudys and Young (2004) for further detail on this early line of Russian research 
in high-dimensional classification. 

The study of high-dimensional random matrices, as treated briefly in Section 1.2.2, also 
has deep roots, dating back to the seminal work from the 1950s onwards (e.g., Wigner, 1955, 
1958; Maréenko and Pastur, 1967; Pastur, 1972; Wachter, 1978; Geman, 1980). The high- 
dimensional asymptotic law for the eigenvalues of a sample covariance matrix illustrated in 
Figure 1.2 is due to Maréenko and Pastur (1967); this asymptotic prediction has been shown 
to be a remarkably robust phenomenon, requiring only mild moment conditions (e.g., Silver- 
stein, 1995; Bai and Silverstein, 2010). See also the paper by Götze and Tikhomirov (2004) 
for quantitative bounds on the distance to this limiting distribution. 

In his Wald Memorial Lecture, Huber (1973) studied the asymptotics of robust regres- 
sion under a high-dimensional scaling with d/n constant. Portnoy (1984; 1985) studied M- 
estimators for high-dimensional linear regression models, proving consistency when the ra- 
tio dose goes to zero, and asymptotic normality under somewhat more stringent conditions. 
See also Portnoy (1988) for extensions to more general exponential family models. The 
high-dimensional asymptotics of various forms of robust regression estimators have been 
studied in recent work by El Karoui and co-authors (e.g., Bean et al., 2013; El Karoui, 2013; 
El Karoui et al., 2013), as well as by Donoho and Montanari (2013). 

Thresholding estimators are widely used in statistical problems in which the estimand 
is expected to be sparse. See the book by Johnstone (2015) for an extensive discussion of 
thresholding estimators in the context of the normal sequence model, with various appli- 
cations in nonparametric estimation and density estimation. See also Chapters 6 and 7 for 
some discussion and analysis of thresholding estimators. Soft thresholding is very closely 
related to ¢;-regularization, a method with a lengthy history (e.g., Levy and Fullagar, 1981; 
Santosa and Symes, 1986; Tibshirani, 1996; Chen et al., 1998; Juditsky and Nemirovski, 
2000; Donoho and Huo, 2001; Elad and Bruckstein, 2002; Candés and Tao, 2005; Donoho, 
2006b; Bickel et al., 2009); see Chapter 7 for an in-depth discussion. 

Stone (1985) introduced the class of additive models (1.20) for nonparametric regression; 
see the book by Hastie and Tibshirani (1990) for more details. The SPAM class (1.21) has 
been studied by many researchers (e.g., Meier et al., 2009; Ravikumar et al., 2009; Koltchin- 
skii and Yuan, 2010; Raskutti et al., 2012). The single-index model (1.22), as a particular 
instance of a semiparametric model, has also been widely studied; for instance, see the var- 
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ious papers (Härdle and Stoker, 1989; Härdle et al., 1993; Ichimura, 1993; Hristache et al., 
2001) and references therein for further details. Friedman and Stuetzle (1981) introduced the 
idea of projection pursuit regression (1.23). In broad terms, projection pursuit methods are 
based on seeking “interesting” projections of high-dimensional data (Kruskal, 1969; Huber, 
1985; Friedman and Tukey, 1994), and projection pursuit regression is based on this idea in 
the context of regression. 
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Basic tail and concentration bounds 


In a variety of settings, it is of interest to obtain bounds on the tails of a random variable, or 
two-sided inequalities that guarantee that a random variable is close to its mean or median. In 
this chapter, we explore a number of elementary techniques for obtaining both deviation and 
concentration inequalities. This chapter serves as an entry point to more advanced literature 
on large-deviation bounds and concentration of measure. 


2.1 Classical bounds 


One way in which to control a tail probability P[X > t] is by controlling the moments of 
the random variable X. Gaining control of higher-order moments leads to correspondingly 
sharper bounds on tail probabilities, ranging from Markov’s inequality (which requires only 
existence of the first moment) to the Chernoff bound (which requires existence of the mo- 
ment generating function). 


2.1.1 From Markov to Chernoff 


The most elementary tail bound is Markov’s inequality: given a non-negative random vari- 
able X with finite mean, we have 


PIX >t]< =a for all t > 0. (2.1) 


This is a simple instance of an upper tail bound. For a random variable X that also has a 
finite variance, we have Chebyshev’s inequality: 


PIX -al >r] < _ for all t > 0. (2.2) 
This is a simple form of concentration inequality, guaranteeing that X is close to its mean 
u = E[X] whenever its variance is small. Observe that Chebyshev’s inequality follows 
by applying Markov’s inequality to the non-negative random variable Y = (X — u). Both 
Markov’s and Chebyshev’s inequalities are sharp, meaning that they cannot be improved in 
general (see Exercise 2.1). 

There are various extensions of Markov’s inequality applicable to random variables with 
higher-order moments. For instance, whenever X has a central moment of order k, an appli- 
cation of Markov’s inequality to the random variable |X — | yields that 
FLIX - al‘) 

tk 


PIX - ul > t] < for all t > 0. (2.3) 
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Of course, the same procedure can be applied to functions other than polynomials |X — |. 
For instance, suppose that the random variable X has a moment generating function in a 
neighborhood of zero, meaning that there is some constant b > O such that the function 
(4) = Efe] exists for all A < |b|. In this case, for any A € [0,b], we may apply 
Markov’s inequality to the random variable Y = e**™), thereby obtaining the upper bound 


T [e] 


P(X - W) > t] = Ple’@ > e] < (2.4) 


e% 


Optimizing our choice of 4 so as to obtain the tightest result yields the Chernoff bound— 
namely, the inequality 


E r F roA- _ 
log P[(X n) z< inf {log [0] — ar}. (2.5) 


As we explore in Exercise 2.3, the moment bound (2.3) with an optimal choice of k is 
never worse than the bound (2.5) based on the moment generating function. Nonetheless, 
the Chernoff bound is most widely used in practice, possibly due to the ease of manipulating 
moment generating functions. Indeed, a variety of important tail bounds can be obtained as 
particular cases of inequality (2.5), as we discuss in examples to follow. 


2.1.2 Sub-Gaussian variables and Hoeffding bounds 


The form of tail bound obtained via the Chernoff approach depends on the growth rate of the 
moment generating function. Accordingly, in the study of tail bounds, it is natural to classify 
random variables in terms of their moment generating functions. For reasons to become clear 
in the sequel, the simplest type of behavior is known as sub-Gaussian. In order to motivate 
this notion, let us illustrate the use of the Chernoff bound (2.5) in deriving tail bounds for a 
Gaussian variable. 


Example 2.1 (Gaussian tail bounds) LetX ~ N(u, o°) be a Gaussian random variable with 
mean u and variance o°. By a straightforward calculation, we find that X has the moment 
generating function 


Eje] =e" valid for all 4 € R. (2.6) 


Substituting this expression into the optimization problem defining the optimized Chernoff 
bound (2.5), we obtain 


Vo? ? 
‘ roA- _ = = 
ov {log le l àr) g int{ 2 ar) — 20? 


where we have taken derivatives in order to find the optimum of this quadratic function. Re- 
turning to the Chernoff bound (2.5), we conclude that any N (u, o°) random variable satisfies 
the upper deviation inequality 


PIX > u+ <e forallt> 0. (2.7) 


In fact, this bound is sharp up to polynomial-factor corrections, as shown by our exploration 
of the Mills ratio in Exercise 2.2. & 
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Motivated by the structure of this example, we are led to introduce the following definition. 


Definition 2.2 A random variable X with mean u = E[X] is sub-Gaussian if there is 
a positive number o such that 


Beer te for all A € R. (2.8) 


The constant o is referred to as the sub-Gaussian parameter; for instance, we say that X 
is sub-Gaussian with parameter o when the condition (2.8) holds. Naturally, any Gaussian 
variable with variance ø? is sub-Gaussian with parameter œ, as should be clear from the cal- 
culation described in Example 2.1. In addition, as we will see in the examples and exercises 
to follow, a large number of non-Gaussian random variables also satisfy the condition (2.8). 


The condition (2.8), when combined with the Chernoff bound as in Example 2.1, shows 
that, if X is sub-Gaussian with parameter ø, then it satisfies the upper deviation inequal- 
ity (2.7). Moreover, by the symmetry of the definition, the variable —X is sub-Gaussian 
if and only if X is sub-Gaussian, so that we also have the lower deviation inequality 
PIX < u-t] < ea. valid for all £ > 0. Combining the pieces, we conclude that any 
sub-Gaussian variable satisfies the concentration inequality 


P[IX—-pl>t]<2e22 forallreR. (2.9) 


Let us consider some examples of sub-Gaussian variables that are non-Gaussian. 


Example 2.3 (Rademacher variables) A Rademacher random variable € takes the values 
{—1,+1} equiprobably. We claim that it is sub-Gaussian with parameter o = 1. By taking 
expectations and using the power-series expansion for the exponential, we obtain 


O08 INK ced k 
Efe**] = set + el) = 1d A) yt 


which shows that £ is sub-Gaussian with parameter o = 1 as claimed. & 


We now generalize the preceding example to show that any bounded random variable is also 
sub-Gaussian. 
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Example 2.4 (Bounded random variables) Let X be zero-mean, and supported on some 
interval [a, b]. Letting X’ be an independent copy, for any 2 € R, we have 


aX AX -Ey [X A(X-X’ 
Eyle] = ble D] < Fyw le” į; 


where the inequality follows from the convexity of the exponential, and Jensen’s inequality. 
Letting £ be an independent Rademacher variable, note that the distribution of (X — X’) is 
the same as that of e(X — X’), so that we have 


a2 (X-X!)2 
Z 


r ‘i © 
Ey x [e] = Exx [Ee] < Exxle 


1, 


where step (i) follows from the result of Example 2.3, applied conditionally with (X, X’) held 
fixed. Since |X — X’| < b — a, we are guaranteed that 


Lee 22 (b-a)? 
z |<e 2 


LX,Xx’ le 


Putting together the pieces, we have shown that X is sub-Gaussian with parameter at most 
o = b — a. This result is useful but can be sharpened. In Exercise 2.4, we work through a 
more involved argument to show that X is sub-Gaussian with parameter at most o = va 


Remark: The technique used in Example 2.4 is a simple example of a symmetrization argu- 
ment, in which we first introduce an independent copy X’, and then symmetrize the problem 
with a Rademacher variable. Such symmetrization arguments are useful in a variety of con- 
texts, as will be seen in later chapters. 


Just as the property of Gaussianity is preserved by linear operations, so is the property 
of sub-Gaussianity. For instance, if X; and X, are independent sub-Gaussian variables with 


parameters o and o>, then X; + X2 is sub-Gaussian with parameter lz + o$. See Exer- 
cise 2.13 for verification of this fact, as well as some related properties. As a consequence 
of this fact and the basic sub-Gaussian tail bound (2.7), we obtain an important result, appli- 
cable to sums of independent sub-Gaussian random variables, and known as the Hoeffding 
bound: 


Proposition 2.5 (Hoeffding bound) Suppose that the variables X;, i = 1,...,n, are 
independent, and X; has mean u; and sub-Gaussian parameter o;. Then for all t > 0, 
we have 


n t 
Pl dick Se ] < a] (2.10) 


X 


The Hoeffding bound is often stated only for the special case of bounded random variables. 
In particular, if X; € [a,b] for alli = 1,2,...,n, then from the result of Exercise 2.4, it is 
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sub-Gaussian with parameter o = bea so that we obtain the bound 


apc — Mi) 2t 
Al 


Although the Hoeffding bound is often stated in this form, the basic idea applies somewhat 
more generally to sub-Gaussian variables, as we have given here. 


= 22 
SENT (2.11) 


We conclude our discussion of sub-Gaussianity with a result that provides three different 
characterizations of sub-Gaussian variables. First, the most direct way in which to establish 
sub-Gaussianity is by computing or bounding the moment generating function, as we have 
done in Example 2.1. A second intuition is that any sub-Gaussian variable is dominated in a 
certain sense by a Gaussian variable. Third, sub-Gaussianity also follows by having suitably 
tight control on the moments of the random variable. The following result shows that all 
three notions are equivalent in a precise sense. 


Theorem 2.6 (Equivalent characterizations of sub-Gaussian variables) Given any 
zero-mean random variable X, the following properties are equivalent: 


(1) There is a constant o > 0 such that 


2 


Eleje —forallaeR. (2.12a) 


dD There is a constant c > 0 and Gaussian random variable Z ~ N(0, 1°) such that 


P[IX| > s] < c P[IZ| > s] forall s >Q. (2.120) 
I) There is a constant 0 = 0 such that 
2k)! 
[peas ay @* = forallk =1,2,.... (2.12c) 
2kk! 
(IV) There is a constant © = Q such that 
ae 1 
Efex] < for all A € [0, 1). (2.12d) 
€ d 


See Appendix A (Section 2.4) for the proof of these equivalences. 


2.1.3 Sub-exponential variables and Bernstein bounds 


The notion of sub-Gaussianity is fairly restrictive, so that it is natural to consider various 
relaxations of it. Accordingly, we now turn to the class of sub-exponential variables, which 
are defined by a slightly milder condition on the moment generating function: 
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Definition 2.7 A random variable X with mean u = E[X] is sub-exponential if there 
are non-negative parameters (v, a) such that 


a 


Ejeet — forall [al <2. (2.13) 


It follows immediately from this definition that any sub-Gaussian variable is also sub- 
exponential—in particular, with v = ø and a = 0, where we interpret 1/0 as being the same 
as +oo. However, the converse statement is not true, as shown by the following calculation: 


Example 2.8 (Sub-exponential but not sub-Gaussian) Let Z ~ N(0, 1), and consider the 
random variable X = Z?. For a < }, we have 


1 +00 
F[et®-)] = 5 { et Ve !2 dz 
VAT J-% 


a 


e 
VIZ 2A 


For A > 5, the moment generating function is infinite, which reveals that X is not sub- 
Gaussian. 

As will be seen momentarily, the existence of the moment generating function in a neigh- 
borhood of zero is actually an equivalent definition of a sub-exponential variable. Let us 


verify directly that condition (2.13) is satisfied. Following some calculus, we find that 


1 
29 


-A 


Sc PPL AEP, forall A] < t, (2.14) 
1-24 
which shows that X is sub-exponential with parameters (v, œ) = (2,4). & 


As with sub-Gaussianity, the control (2.13) on the moment generating function, when 
combined with the Chernoff technique, yields deviation and concentration inequalities for 
sub-exponential variables. When ¢ is small enough, these bounds are sub-Gaussian in nature 
(i.e., with the exponent quadratic in t), whereas for larger t, the exponential component of 
the bound scales linearly in t. We summarize in the following: 


Proposition 2.9 (Sub-exponential tail bound) Suppose that X is sub-exponential with 
parameters (v, a). Then 


2 
pra Y9StS 3 
ee ae fort >. 


~ 


As with the Hoeffding inequality, similar bounds can be derived for the left-sided event 
{X — u < —t}, as well as the two-sided event {|X — | > t}, with an additional factor of 2 in 
the latter case. 
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Proof By recentering as needed, we may assume without loss of generality that u = 0. 
We follow the usual Chernoff-type approach: combining it with the definition (2.13) of a 
sub-exponential variable yields the upper bound 


A 2 
P[X > t] < e” Efe] < exp -x + x), valid for all A € [0, a7). 


—— 
gt) 


In order to complete the proof, it remains to compute, for each fixed t > 0, the quantity 
S(t) := inf aco,e- g(A, t). Note that the unconstrained minimum of the function g(-, t£) occurs 
at A* = t/v’. If 0 < t < ©, then this unconstrained optimum corresponds to the constrained 


minimum as well, so that g*(t) = — an over this interval. 


Otherwise, we may assume that t > x In this case, since the function g(., t) is monotoni- 
cally decreasing in the interval [0, A*), the constrained minimum is achieved at the boundary 
point A’ = a!" and we have 


2 % 
O=- E, 
a wa 2a 


where inequality (i) uses the fact that Lae» 
Qa 


As shown in Example 2.8, the sub-exponential property can be verified by explicitly com- 
puting or bounding the moment generating function. This direct calculation may be imprac- 
ticable in many settings, so it is natural to seek alternative approaches. One such method is 
based on control of the polynomial moments of X. Given a random variable X with mean 
u = E[X] and variance o? = E[X?] — u?, we say that Bernstein’s condition with parameter b 
holds if 


FX -WJI < $k! b fork = 2,3,4,.... (2.15) 


One sufficient condition for Bernstein’s condition to hold is that X be bounded; in partic- 
ular, if |X — u| < b, then it is straightforward to verify that condition (2.15) holds. Even 
for bounded variables, our next result will show that the Bernstein condition can be used 
to obtain tail bounds that may be tighter than the Hoeffding bound. Moreover, Bernstein’s 
condition is also satisfied by various unbounded variables, a property which lends it much 
broader applicability. 


When X satisfies the Bernstein condition, then it is sub-exponential with parameters de- 
termined by g° and b. Indeed, by the power-series expansion of the exponential, we have 


Vor EX -p') 
Ep pUxX-w)] — k 
[e J=1+ 5 + 2; a Ti 


(i) re VPS 
<ia + Alby, 
PERS 2! |b) 


where the inequality (i) makes use of the Bernstein condition (2.15). For any |A| < 1/b, we 
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can sum the geometric series so as to obtain 


Qo 2. ii = 
Vel O eee (2.16) 


T [ete] <i 


1-b — 
where inequality (ii) follows from the bound | + t < e’. Consequently, we conclude that 
rA- < p EE T 
[e ]<e for all |A| < 5;, 


showing that X is sub-exponential with parameters ( V2ø, 2b). 


As a consequence, an application of Proposition 2.9 leads directly to tail bounds on a 
random variable satisfying the Bernstein condition (2.15). However, the resulting tail bound 
can be sharpened slightly, at least in terms of constant factors, by making direct use of the 
upper bound (2.16). We summarize in the following: 


Proposition 2.10 (Bernstein-type bound) For any random variable satisfying the 
Bernstein condition (2.15), we have 


420212 
Eje] <err forall |a\ < Ł, (2.17a) 
and, moreover, the concentration inequality 


PUX—pl>th<2e 5  forallt>0. (2.17b) 


d 


We proved inequality (2.17a) in the discussion preceding this proposition. Using this 
bound on the moment generating function, the tail bound (2.17b) follows by setting a = 
m= € 10, 1) in the Chernoff bound, and then simplifying the resulting expression. 
Remark: Proposition 2.10 has an important consequence even for bounded random vari- 
ables (i.e., those satisfying |X — u| < b). The most straightforward way to control such vari- 
ables is by exploiting the boundedness to show that (X — u) is sub-Gaussian with parameter b 
(see Exercise 2.4), and then applying a Hoeffding-type inequality (see Proposition 2.5). Al- 
ternatively, using the fact that any bounded variable satisfies the Bernstein condition (2.16), 
we can also apply Proposition 2.10, thereby obtaining the tail bound (2.17b), that involves 
both the variance o° and the bound b. This tail bound shows that for suitably small t, the 
variable X has sub-Gaussian behavior with parameter o, as opposed to the parameter b that 
would arise from a Hoeffding approach. Since o? = E[(X — )*] < b’, this bound is never 
worse; moreover, it is substantially better when o? < b?, as would be the case for a ran- 
dom variable that occasionally takes on large values, but has relatively small variance. Such 
variance-based control frequently plays a key role in obtaining optimal rates in statistical 
problems, as will be seen in later chapters. For bounded random variables, Bennett’s in- 
equality can be used to provide sharper control on the tails (see Exercise 2.7). 


Like the sub-Gaussian property, the sub-exponential property is preserved under sum- 
mation for independent random variables, and the parameters (v, œ) transform in a simple 
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way. In particular, consider an independent sequence {X;}7_, of random variables, such that 
X, has mean uz, and is sub-exponential with parameters (vz, œx). We compute the moment 
generating function 


n oy n 
Efedi Xm] È | | Erea] 2 [eee 
k=1 


k=1 


equality (ii) follows since X% is sub-exponential with parameters (1%, a). Thus, we conclude 
that the variable ));_, (Xz — Mx) is sub-exponential with the parameters (v., œ„), where 


n 
œ, := max a, and y,:= | v. 
k=1,...,n 


Using the same argument as in Proposition 2.9, this observation leads directly to the upper 
tail bound 


n ont 2 
1 em for0<t< & 
el; D -m> | S) u TEN (2.18) 
né ew fort> =, 


along with similar two-sided tail bounds. Let us illustrate our development thus far with 
some examples. 


Example 2.11 (y7-variables) A chi-squared (x°) random variable with n degrees of free- 
dom, denoted by Y ~ X can be represented as the sum Y = X% Ze where Z, ~ N(O, 1) 
are i.i.d. variates. As discussed in Example 2.8, the variable Z? is sub-exponential with pa- 
rameters (2,4). Consequently, since the variables {Z;}/7_, are independent, the y?-variate Y is 
sub-exponential with parameters (v, œ) = (2 yn, 4), and the preceding discussion yields the 
two-sided tail bound 
ell: X z -1 < 2e™ 8. for all t € (0, 1). (2.19) 
n w & 


>t 


The concentration of y?-variables plays an important role in the analysis of procedures based 
on taking random projections. A classical instance of the random projection method is the 
Johnson—Lindenstrauss analysis of metric embedding. 


Example 2.12 (Johnson—Lindenstrauss embedding) As one application of the concentra- 
tion of y?-variables, consider the following problem. Suppose that we are given N > 2 
distinct vectors {u',...,u%}, with each vector lying in IR’. If the data dimension d is large, 
then it might be expensive to store and manipulate the data set. The idea of dimensionality 
reduction is to construct a mapping F: R? — R”—with the projected dimension m substan- 
tially smaller than d—that preserves some “essential” features of the data set. What features 
should we try to preserve? There is not a unique answer to this question but, as one in- 
teresting example, we might consider preserving pairwise distances, or equivalently norms 
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and inner products. Many algorithms are based on such pairwise quantities, including lin- 
ear regression, methods for principal components, the k-means algorithm for clustering, and 
nearest-neighbor algorithms for density estimation. With these motivations in mind, given 
some tolerance 6 € (0, 1), we might be interested in a mapping F with the guarantee that 


IFG) - FOIE des ok 
e <(1+ô) for all pairs w + w. (2.20) 
2 


(1 - ô) < 
In words, the projected data set {F(u'),..., F(u™)} preserves all pairwise squared distances 
up to a multiplicative factor of 6. Of course, this is always possible if the projected dimension 
m is large enough, but the goal is to do it with relatively small m. 

Constructing such a mapping that satisfies the condition (2.20) with high probability 
turns out to be straightforward as long as the projected dimension is lower bounded as 
m% 4 log N. Observe that the projected dimension is independent of the ambient dimension 
d, and scales only logarithmically with the number of data points N. 

The construction is probabilistic: first form a random matrix X € R””“ filled with inde- 
pendent N (0, 1) entries, and use it to define a linear mapping F: R? > R” via u œ> Xu/ym. 
We now verify that F satisfies condition (2.20) with high probability. Let x; € R? denote 
the ith row of X, and consider some fixed u + 0. Since x; is a standard normal vector, the 
variable (x;, u/||ull2) follows a NO, 1) distribution, and hence the quantity 


Xu 
= 2 =) s u/lledla)” 


2: 
Wiz S 


follows a x distribution with m degrees of freedom, using the independence of the rows. 
Therefore, applying the tail bound (2.19), we find that 


Xu 2 
[Z-s < 268 for all 8 € (0,1). 
m|lull5 
Rearranging and recalling the definition of F yields the bound 
Fw)? 2 
PERE ka-a a +91] a ers, for any fixed 0 + u € R4. 
Ulla 


Noting that there are (3) distinct pairs of data points, we apply the union bound to conclude 


that 
< 2 Pa 
2 


For any e € (0, 1), this probability can be driven below e by choosing m > 3 1$ log(N /€). % 


Fi — uD ee 
P| ————__—} ¢ [(1 — 6), (1 + 6)] for some w + u’ 


llui — will; 


In parallel to Theorem 2.13, there are a number of equivalent ways to characterize a sub- 
exponential random variable. The following theorem provides a summary: 
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Theorem 2.13 (Equivalent characterizations of sub-exponential variables) For a zero- 
mean random variable X, the following statements are equivalent: 


MŒ) There are non-negative numbers (y, a) such that 


y242 


Fle] <e? 


for all |A| < L, (2.21a) 


(II) There is a positive number co > 0 such that E{e**] < œ for all |A| < co. 
D There are constants c,,c2 > 0 such that 


P[IX| > t])<cpe for allt > 0. (2.21b) 


1/k 
(IV) The quantity y := sups |" ] is finite. 


See Appendix B (Section 2.5) for the proof of this claim. 


2.1.4 Some one-sided results 


Up to this point, we have focused on two-sided forms of Bernstein’s condition, which yields 
bounds on both the upper and lower tails. As we have seen, one sufficient condition for 
Bernstein’s condition to hold is a bound on the absolute value, say |X| < b almost surely. Of 
course, if such a bound only holds in a one-sided way, it is still possible to derive one-sided 
bounds. In this section, we state and prove one such result. 


Proposition 2.14 (One-sided Bernstein’s inequality) Jf X < b almost surely, then 


© E[X?] 
A(X-E[X]) 2 
Efe ] < exo ae 


3 


for all A € [0,3/b). (2.22a) 


Consequently, given n independent random variables such that X; < b almost surely, 
we have 


n 62 
P X; — E[X;]) = | < -n 2.22b) 
py ©) ON oye, Ee) 


Of course, if a random variable is bounded from below, then the same result can be used 
to derive bounds on its lower tail; we simply apply the bound (2.22b) to the random variable 
—X. In the special case of independent non-negative random variables Y; > 0, we find that 


aXe —E[Yi]) < -nô 


i=1 


ph 
< exp| z Sl (2.23) 
2 Xa L [Y?] 


Thus, we see that the lower tail of any non-negative random variable satisfies a bound of the 
sub-Gaussian type, albeit with the second moment instead of the variance. 
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The proof of Proposition 2.14 is quite straightforward given our development thus far. 


Proof Defining the function 
hu) = 25 =2 SF, 
u 


we have the expansion 


Efe**] = 1 + AE[X] + 42°E[X7A(AX)]. 
Observe that for all scalars x < 0, x’ € [0, b] and A > 0, we have 


h(Ax) < h(0) < h(Ax’) < h(Ab). 


Consequently, since X < b almost surely, we have E[X7h(AX)] < E[X?]A(4b), and hence 


Eje HAD] < eM] + AELX] + 52° ELX? IAAD) 
2p y2 
< exp eet nao) 


Consequently, the bound (2.22a) will follow if we can show that h(Ab) < (1 - ab) for 
Ab < 3. By applying the inequality k! > 2(3*-*), valid for all k > 2, we find that 


Ab? Ab 1 
(Ab) = 29 es p Se ii 
3 


3 


where the condition 4 4p € [0, 1) allows us to sum the geometric series. 

In order to prove the upper tail bound (2.22b), we apply the Chernoff bound, exploiting in- 
dependence to apply the moment generating function bound (2.22a) separately, and thereby 
find that 


n È D ELX? 
[>a — E[X;]) > nô| < exp| Ané + a : ) valid for bA € [0, 3). 
i=] ~ 3 
Substituting 
no 
A= SLE = * [0, 3/b) 


and simplifying yields the bound. 


2.2 Martingale-based methods 


Up until this point, our techniques have provided various types of bounds on sums of in- 
dependent random variables. Many problems require bounds on more general functions of 
random variables, and one classical approach is based on martingale decompositions. In 
this section, we describe some of the results in this area along with some examples. Our 
treatment is quite brief, so we refer the reader to the bibliographic section for additional 
references. 
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2.2.1 Background 


Let us begin by introducing a particular case of a martingale sequence that is especially rel- 
evant for obtaining tail bounds. Let {X;}/_, be a sequence of independent random variables, 
and consider the random variable f(X) = f(X,...,X,), for some function f: R” — R. Sup- 
pose that our goal is to obtain bounds on the deviations of f from its mean. In order to do 
so, we consider the sequence of random variables given by Yo = E[f(X)], Y,, = f(X), and 


Y, = EW |X... X] fork =1,...,n—1, (2.24) 


where we assume that all conditional expectations exist. Note that Yo is a constant, and the 
random variables Y, will tend to exhibit more fluctuations as we move along the sequence 
from Yo to Y„. Based on this intuition, the martingale approach to tail bounds is based on the 
telescoping decomposition 
n 
F(X) - EIO] = Y,- Yo = X Yr- Yew, 
cans sett 


k=1 D, 


in which the deviation f(X) — E[f(X)] is written as a sum of increments {D;}7_,. As we 
will see, the sequence {Y;}/_, is a particular example of a martingale sequence, known as 
the Doob martingale, whereas the sequence {Dz}; is an example of a martingale difference 
sequence. 

With this example in mind, we now turn to the general definition of a martingale sequence. 
Let {Fk}; be a sequence of o-fields that are nested, meaning that Fk C Fk+ı for all k > 1; 
such a sequence is known as a filtration. In the Doob martingale described above, the o-field 
o(X,,...,X,) generated by the first k variables plays the role of Fz. Let {Y;}-, be a sequence 
of random variables such that Y, is measurable with respect to the o-field F}. In this case, 
we say that {Y;}°, is adapted to the filtration {F;}~,. In the Doob martingale, the random 
variable Y, is a measurable function of (X;,...,X;), and hence the sequence is adapted to 
the filtration defined by the o-fields. We are now ready to define a general martingale: 


Definition 2.15 Given a sequence {Y;}}_; of random variables adapted to a filtration 
{Fr} the pair {(%;, Fi}, is a martingale if, for all k > 1, 


EY] <œ and E[Yes1 | Fel = Ve. (2.25) 


It is frequently the case that the filtration is defined by a second sequence of random vari- 
ables {Xz}; via the canonical o-fields F} := o (X1, .. . , Xx). In this case, we say that {Yi}? , 
is a martingale sequence with respect to {X;}7,. The Doob construction is an instance of 
such a martingale sequence. If a sequence is martingale with respect to itself (i.e., with 
Fk = 0(Y1,...,Yx)), then we say simply that {Y;}°, forms a martingale sequence. 


Let us consider some examples to illustrate: 


Example 2.16 (Partial sums as martingales) Perhaps the simplest instance of a martingale 
is provided by considering partial sums of an i.i.d. sequence. Let {X;}°, be a sequence 
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of i.i.d. random variables with mean u, and define the partial sums S$; := ya X;. Defining 
F, = o(X,...,X;,), the random variable S+ is measurable with respect to Fg, and, moreover, 
we have 


EDS pat | Fel = EIX er + $e | Xi- Xe] 
= EX] + Sx 
=utS,. 


Here we have used the facts that X;,4; is independent of xt := (X,...,X;,), and that S; is 
a function of xt. Thus, while the sequence {S}; itself is not a martingale unless u = 0, 
the recentered variables Y := Sx — ku for k > 1 define a martingale sequence with respect 
to {Xi} ee $% 


Let us now show that the Doob construction does lead to a martingale, as long as the under- 
lying function f is absolutely integrable. 


Example 2.17 (Doob construction) Given a sequence of independent random variables 
{Xz}; recall the sequence Y, = E[f(X) | X1, .-., Xx] previously defined, and suppose that 
E[|f(X)|] < œ. We claim that {Y;}/_, is a martingale with respect to {X;}/_,. Indeed, in terms 


of the shorthand Xf = (X1, X2, . . . , X4), we have 
FIYA = EIEL | XAN < ENFI < 2%, 


where the bound follows from Jensen’s inequality. Turning to the second property, we have 


a 


E[Yent | Xt] = ELECO | XP "11 xq] S 


EL A(X) | Xt] = Yy, 


where we have used the tower property of conditional expectation in step (i). + 


The following martingale plays an important role in analyzing stopping rules for sequential 
hypothesis tests: 


Example 2.18 (Likelihood ratio) Let f and g be two mutually absolutely continuous den- 
sities, and let {X;}}; be a sequence of random variables drawn i.i.d. according to f. For 
each k > 1, let Yp := [I$ oO be the likelihood ratio based on the first k samples. Then the 
sequence {Yz}; is a martingale with respect to {Xz}. Indeed, we have 


|e 78X _ 
Tanl yam A 


ELV net | X1,...,Xn] = 


using the fact that L E =1. 4 


A closely related notion is that of martingale difference sequence, meaning an adapted 
sequence {(D,, Fr) lg; such that, for all k > 1, 


HIID] <œ and E[Dys1 | Fi] = 9. (2.26) 


As suggested by their name, such difference sequences arise in a natural way from martin- 
gales. In particular, given a martingale {(Y;, F,)}2.o, let us define Dy = Y, — Yx_; for k > 1. 
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We then have 


E [Dest | Fel = ELV ear | Fe] — ELY: | Fe] 
= E[l Yk | Fi] — Ye = 0, 


using the martingale property (2.25) and the fact that Y is measurable with respect to Fg. 
Thus, for any martingale sequence {Y;};°.), we have the telescoping decomposition 


Y,- Yo = >) Dr, (2.27) 
k=1 


where {D,}-., is a martingale difference sequence. This decomposition plays an important 
role in our development of concentration inequalities to follow. 


2.2.2 Concentration bounds for martingale difference sequences 


We now turn to the derivation of concentration inequalities for martingales. These inequal- 
ities can be viewed in one of two ways: either as bounds for the difference Y,, — Yo, or as 
bounds for the sum });_, Dx of the associated martingale difference sequence. Throughout 
this section, we present results mainly in terms of martingale differences, with the under- 
standing that such bounds have direct consequences for martingale sequences. Of particular 
interest to us is the Doob martingale described in Example 2.17, which can be used to con- 
trol the deviations of a function from its expectation. 


We begin by stating and proving a general Bernstein-type bound for a martingale differ- 
ence sequence, based on imposing a sub-exponential condition on the martingale differences. 


la 
Theorem 2.19 Let {(D;x, F)}., be a martingale difference sequence, and suppose 
that Eje | F,-,] < e* %/? almost surely for any |A| < 1/a,. Then the following hold: 


(a) The sum >i, Dx is sub-exponential with parameters e DS imi ve a.) where a, := 
maxX;=1.... 
(b) The sum satisfies the concentration inequality 


r| Xp 


Proof We follow the standard approach of controlling the moment generating function of 
dy Dr, and then applying the Chernoff bound. For any scalar A such that |A| < Ł, condi- 


2 
E t as n wa 
Qe Hat if O<t< Diet YE 
a 


ae (2.28) 
Qe7 ms jf o> ae 


> | < 
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tioning on F,,_; and applying iterated expectation yields 


F [ert Èr- >) = E [ert wi Dx) T [etn | F,1]] 


<E [e Zii Di Jed ral, (2.29) 


where the inequality follows from the stated assumption on D,,. Iterating this procedure 
yields the bound E[e*=1?"] < e¥i%/?, valid for all |A| < +. By definition, we con- 


clude that >), Dx is sub-exponential with parameters (, ive a.), as claimed. The tail 
bound (2.28) follows by applying Proposition 2.9. 


In order for Theorem 2.19 to be useful in practice, we need to isolate sufficient and 
easily checkable conditions for the differences D% to be almost surely sub-exponential (or 
sub-Gaussian when a = 0). As discussed previously, bounded random variables are sub- 
Gaussian, which leads to the following corollary: 


Corollary 2.20 (Azuma—Hoeffding) Let ({(Dx, Fi) }2.,) be a martingale difference se- 
quence for which there are constants {(a,, by)};_, such that D; € [ax, br] almost surely 
for allk =1,...,n. Then, for all t = 0, 


|$ 


k=l 
Proof Recall the decomposition (2.29) in the proof of Theorem 2.19; from the structure 
of this argument, it suffices to show that E[e*?« | F,_,] < e?@:-"/8 almost surely for each 
k =1,2,...,n. But since Dx € [ax, by] almost surely, the conditioned variable (Dx | Fk-1) 
also belongs to this interval almost surely, and hence from the result of Exercise 2.4, it is 
sub-Gaussian with parameter at most © = (by — ax)/2. 


> 1 < 2e Ta, (2.30) 


An important application of Corollary 2.20 concerns functions that satisfy a bounded 
difference property. Let us first introduce some convenient notation. Given vectors x, x’ € R” 


and an index k € {1,2,...,n}, we define a new vector x“ € R” via 
; if j#k 
geet) TIER (2.31) 
x, if j=k. 
With this notation, we say that f: R” — R satisfies the bounded difference inequality with 
parameters (L1, ..., Ln) if, for each index k = 1,2,...,n, 
If) — fa) <L, forall x,x’ € R”. (2.32) 


For instance, if the function f is L-Lipschitz with respect to the Hamming norm dy(x, y) = 
>, IL; + yi], which counts the number of positions in which x and y differ, then the 
bounded difference inequality holds with parameter L uniformly across all coordinates. 


2.2 Martingale-based methods 


37 


Corollary 2.21 (Bounded differences inequality) Suppose that f satisfies the bounded 
difference property (2.32) with parameters (L,,...,L,) and that the random vector 
X = (X1, X2, . . . , Xn) has independent components. Then 


22 


PUS(X) - ELON > 4 <2e X% ~— for allt > 0. (2.33) 


~ 


Proof Recalling the Doob martingale introduced in Example 2.17, consider the associat 
martingale difference sequence 


ed 


Dy = ELFCX) | X1,..., Xe] — ELF) | Xi,- Xe-r]. (2.34) 


We claim that D; lies in an interval of length at most L; almost surely. In order to prove this 


claim, define the random variables 


Ax := inf EL S(X) | Xi... Xr, x] — ELX) | X1,..., Xr] 


aud By := sup ELf(X) | Xy,....Xiisx] — EŒ | Xi.. Xe: 


On one hand, we have 


Dy — Ar = EL F(X) | X1,..., Xe) ~ inf EL S(X) | X1, ..., Xe, x], 


so that D; > A; almost surely. A similar argument shows that D, < B; almost surely. 


We now need to show that B — Ax < Lp almost surely. Observe that by the independence 


of {X;,}7_,, we have 


FLAX) | isco eel = Eel f Gis Xk XE) for any vector (x1,..., Xx), 


where E,,; denotes expectation over X7,, := (Xx41,-..,Xn). Consequently, we have 


By — Ax = sup Fe lf(M, 2. Xe. X, Xp] ra inf Feel P(X, 2. Xe, X, Xai] 


< sup Fen lf (Xi, fae , Xk-1, X, Xp 41) igi fX, ee Xesy, Xl 


xy 
< kk, 


using the bounded differences assumption. Thus, the variable D, lies within an interval 
length L; at most surely, so that the claim follows as a corollary of the Azuma—Hoeffdi 
inequality. 


Remark: In the special case when f is L-Lipschitz with respect to the Hamming nor 
Corollary 2.21 implies that 


of 
ng 


m, 


PIW- EFQ > 1) <20 4 forall >0. (2.35) 


Let us consider some examples to illustrate. 
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Example 2.22 (Classical Hoeffding from bounded differences) As a warm-up, let us show 
how the classical Hoeffding bound (2.11) for bounded variables—say X; € [a,b] almost 
surely—follows as an immediate corollary of the bound (2.35). Consider the function 
f(%1,---5%n) = YG — ui), where u; = E[X;] is the mean of the ith random variable. 
For any index k € {1,...,}, we have 


If) — FO) = a- Med) — (2% — Had 


= |x - xl < b-a, 


showing that f satisfies the bounded difference inequality in each coordinate with parameter 
L = b — a. Consequently, it follows from the bounded difference inequality (2.35) that 


e| Ya — Hi) 
izl 


which is the classical Hoeffding bound for independent random variables. & 


ay 2 
> t| <2e noa, 


The class of U-statistics frequently arise in statistical problems; let us now study their 
concentration properties. 


Example 2.23 (U-statistics) Let g: R? — R be a symmetric function of its arguments. 
Given an i.i.d. sequence Xz, k > 1, of random variables, the quantity 
1 
U = eX; Xx) (2.36) 
(5) j<k 
is known as a pairwise U-statistic. For instance, if g(s,t) = |s — t|, then U is an unbiased 
estimator of the mean absolute pairwise deviation E[|X; — X2|]. Note that, while U is not a 
sum of independent random variables, the dependence is relatively weak, and this fact can 
be revealed by a martingale analysis. If g is bounded (say ||g|l.. < b), then Corollary 2.21 
can be used to establish the concentration of U around its mean. Viewing U as a function 


fœ = f(%1,..., Xn), for any given coordinate k, we have 
1 
Fa- Fa A Y apx — 8px 
2) jtk 


_ = 2b) _ 4b 


() o” 
so that the bounded differences property holds with parameter L; = 4 in each coordinate. 
Thus, we conclude that 


PIIU - E[U]| > t] < 26°. 


This tail inequality implies that U is a consistent estimate of E[U], and also yields finite 
sample bounds on its quality as an estimator. Similar techniques can be used to obtain tail 
bounds on U-statistics of higher order, involving sums over k-tuples of variables. & 


Martingales and the bounded difference property also play an important role in analyzing 
the properties of random graphs, and other random combinatorial structures. 
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Example 2.24 (Clique number in random graphs) An undirected graph is a pair G = (V, E), 
composed of a vertex set V = {1,...,d} and an edge set E, where each edge e = (i, j) is an 
unordered pair of distinct vertices (i + j). A graph clique C is a subset of vertices such that 
(i, j) € E for all i, j € C. The clique number C(G) of the graph is the cardinality of the 
largest clique—note that C(G) € [1, d]. When the edges E of the graph are drawn according 
to some random process, then the clique number C(G) is a random variable, and we can 
study its concentration around its mean E[C(G)]. 

The Erdés—Rényi ensemble of random graphs is one of the most well-studied models: it is 
defined by a parameter p € (0, 1) that specifies the probability with which each edge (i, j) is 
included in the graph, independently across all (5) edges. More formally, for each i < j, let 
us introduce a Bernoulli edge-indicator variable X;; with parameter p, where X;; = 1 means 
that edge (i, j) is included in the graph, and X;; = 0 means that it is not included. 


2 
may view the clique number C(G) as a function Z + f(Z). Let Z’ denote a vector in which 


a single coordinate of Z has been changed, and let G’ and G be the associated graphs. It is 
easy to see that C(G’) can differ from C(G) by at most 1, so that |f(Z’) — f(Z)| < 1. Thus, 
the function C(G) = f(Z) satisfies the bounded difference property in each coordinate with 
parameter L = 1, so that 


Note that the ({)-dimensional random vector Z := {X;;}i<; specifies the edge set; thus, we 


P[AC(G) - E[C(G)]] > 6] < 2e7"*. 


Consequently, we see that the clique number of an Erdés—Rényi random graph is very 
sharply concentrated around its expectation. & 


Finally, let us study concentration of the Rademacher complexity, a notion that plays a 
central role in our subsequent development in Chapters 4 and 5. 


Example 2.25 (Rademacher complexity) Let {e,}{_, be an i.i.d. sequence of Rademacher 
variables (i.e., taking the values {—1, +1} equiprobably, as in Example 2.3). Given a collec- 
tion of vectors A c R”, define the random variable! 


Z:= sup] >) a = sup[(a, €)]. (2.37) 
acA k=l acA 

The random variable Z measures the size of A in a certain sense, and its expectation R(A) := 

E[Z(A)] is known as the Rademacher complexity of the set A. 

Let us now show how Corollary 2.21 can be used to establish that Z(A) is sub-Gaussian. 
Viewing Z(A) as a function (€1,..., E€) œ f(€1,..., En), we need to bound the maximum 
change when coordinate k is changed. Given two Rademacher vectors ¢,e’ € {-1,+1}", 
recall our definition (2.31) of the modified vector £“. Since f(e) > (a, e“) for any a € A, 
we have 


(a, £) — fle“) < {a,€ — £") = alex — &%) < lagl. 
Taking the supremum over A on both sides, we obtain the inequality 


fle) - fle“) <2 sup lal. 


' For the reader concerned about measurability, see the bibliographic discussion in Chapter 4. 
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Since the same argument applies with the roles of s and e reversed, we conclude that 
f satisfies the bounded difference inequality in coordinate k with parameter 2 sup,<q larl. 
Consequently, Corollary 2.21 implies that the random variable Z(A) is sub-Gaussian with 


parameter at most 2 4/));_) SUPaea 4. This sub-Gaussian parameter can be reduced to the 


(potentially much) smaller quantity ,/supjcq Dj-1 % using alternative techniques; in partic- 
ular, see Example 3.5 in Chapter 3 for further details. 4 


2.3 Lipschitz functions of Gaussian variables 


We conclude this chapter with a classical result on the concentration properties of Lipschitz 
functions of Gaussian variables. These functions exhibit a particularly attractive form of 
dimension-free concentration. Let us say that a function f: R” — R is L-Lipschitz with 
respect to the Euclidean norm || - ||2 if 


f(x) — fO)| < Lllx- yll for all x,y € R”. (2.38) 


The following result guarantees that any such function is sub-Gaussian with parameter at 
most L: 


Theorem 2.26 Let (X,,...,X,) be a vector of i.i.d. standard Gaussian variables, and 
let f: R” — R be L-Lipschitz with respect to the Euclidean norm. Then the variable 
f(X) — ELfCO] is sub-Gaussian with parameter at most L, and hence 


PIFCO- EPCOS i<2e jorallr= 0. (2.39) 


Note that this result is truly remarkable: it guarantees that any L-Lipschitz function of a 
standard Gaussian random vector, regardless of the dimension, exhibits concentration like a 
scalar Gaussian variable with variance L’. 


Proof With the aim of keeping the proof as simple as possible, let us prove a version of the 
concentration bound (2.39) with a weaker constant in the exponent. (See the bibliographic 
notes for references to proofs of the sharpest results.) We also prove the result for a function 
that is both Lipschitz and differentiable; since any Lipschitz function is differentiable almost 
everywhere,” it is then straightforward to extend this result to the general setting. For a dif- 
ferentiable function, the Lipschitz property guarantees that ||[Vf(*)|lz < L for all x € R”. In 
order to prove this version of the theorem, we begin by stating an auxiliary technical lemma: 


2 This fact is a consequence of Rademacher’s theorem. 
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Lemma 2.27 Suppose that f: R” — R is differentiable. Then for any convex function 
$: R > R, we have 


Elo fo) -= EOD] < Elo vs% ¥) )|, (2.40) 


where X,Y ~ N(0, I„) are standard multivariate Gaussian, and independent. 
< y 


We now prove the theorem using this lemma. For any fixed 2 € R, applying inequal- 
ity (2.40) to the convex function t œ e® yields 


Ex exp (A1 - ELFCON)] < Exx[exp (4 «x, vr )| 


a x? 


8 


= Ex[exp(— veo), 


where we have used the independence of X and Y to first take the expectation over Y 
marginally, and the fact that (Y, Vf(x)) is a zero-mean Gaussian variable with variance 
IV f()|5. Due to the Lipschitz condition on f, we have ||Vf(x)ll2 < L for all x € R”, whence 


E| exp (AF - ELFCn)| < eb, 


which shows that f(X) — E[f(X)] is sub-Gaussian with parameter at most z, The tail bound 


oF 
PIX) - ELX) = 1] < 2exp(-45) for all t > 0 
mL? 
follows from Proposition 2.5. 
It remains to prove Lemma 2.27, and we do so via a classical interpolation method that 
exploits the rotation invariance of the Gaussian distribution. For each 6 € [0, 7/2], consider 
the random vector Z(@) € R” with components 


Z:(0) := X; sin 0 + Yz cos 0 for k = 1,2,...,n. 


By the convexity of ¢, we have 


Ex[O(f(X) — Ey [f(D] < Exvl@Go - FY]. (2.41) 
Now since Z,(0) = Y, and Z;,(2/2) = X, for all k = 1,...,n, we have 


me [2 d m2 
fX) - FY) = f 70 f(Z(@)) dé = Hf (Vf(Z(O)), Z'(0)) d0, (2.42) 
( 


where Z’(6) € R” denotes the elementwise derivative, a vector with the components Z; (0) = 
Xx cos 0— Y; sin 8. Substituting the integral representation (2.42) into our earlier bound (2.41) 
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yields 


m [2 
Exlo(f(X) - Erf] < sold f (Vf(Z(@)), Z’()) ao)| 
0 


1 7/2 a A 
> vold { 5 (VF(Z(O)), Z'(A)) as) 


7/2 
<f Eol orzo zo) ea 
m/2 Jo 2 
where the final step again uses convexity of ¢. By the rotation invariance of the Gaussian 
distribution, for each @ € [0,7/2], the pair (Z;(@), Z,(0)) is a jointly Gaussian vector, with 
zero mean and identity covariance I,. Therefore, the expectation inside the integral in equa- 
tion (2.43) does not depend on 6, and hence 


al Fxalo(S WZ), zo )| do = JE v. 7))] 


where (X, Y) are independent standard Gaussian n-vectors. This completes the proof of the 
bound (2.40). 


Note that the proof makes essential use of various properties specific to the standard 
Gaussian distribution. However, similar concentration results hold for other non-Gaussian 
distributions, including the uniform distribution on the sphere and any strictly log-concave 
distribution (see Chapter 3 for further discussion of such distributions). However, without 
additional structure of the function f (such as convexity), dimension-free concentration for 
Lipschitz functions need not hold for an arbitrary sub-Gaussian distribution; see the biblio- 
graphic section for further discussion of this fact. 


Theorem 2.26 is useful for a broad range of problems; let us consider some examples to 
illustrate. 


Example 2.28 (x? concentration) For a given sequence {Z}; Of i.i.d. standard normal 
variates, the random variable Y := )f_, Z? follows a y*-distribution with n degrees of free- 
dom. The most direct way to obtain tail bounds on Y is by noting that Z? is sub-exponential, 
and exploiting independence (see Example 2.11). In this example, we pursue an alternative 
approach—namely, via concentration for Lipschitz functions of Gaussian variates. Indeed, 
defining the variable V = VY/Vyn, we can write V = ||(Z;,...,Z,)ll2/-Vn, and since the 
Euclidean norm is a 1-Lipschitz function, Theorem 2.26 implies that 


PIV > EV] +6] < e”? forall ô 2 0. 


Using concavity of the square-root function and Jensen’s inequality, we have 


_ Lv 1/2 
EV] < VE[V2] = D aiza} =1. 


i=1 


Recalling that V = VY /yn and putting together the pieces yields 
PIY/n > (1 +8] <e”? forall ô > 0. 
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Since (1 + 6)? = 1 + 26 + 6? < 1 + 36 for all 6 € [0, 1], we conclude that 
P[Y>nd+p}<e"!'8 — forallre [0,3], (2.44) 


where we have made the substitution t = 36. It is worthwhile comparing this tail bound to 
those that can be obtained by using the fact that each Z? is sub-exponential, as discussed in 
Example 2.11. + 


Example 2.29 (Order statistics) Given a random vector (X1, X2, .. . , Xn), its order statistics 
are obtained by reordering its entries in a non-decreasing manner—namely as 


Xo < Xo Ss X(n-1) < Xin): (2.45) 


pu AA GEESE ACL) Fr RE KL 5) 


random vector (Y;,..., Y„), it can be shown that |Xq — Yæœl < IIX — Ylh for all k = 1,...,n, 
so that each order statistic is a 1-Lipschitz function. (We leave the verification of this in- 
equality as an exercise for the reader.) Consequently, when X is a Gaussian random vector, 
Theorem 2.26 implies that 


2 


PX) — ElXw]l = 6] < 2e? 


for all 6 > 0. & 


Example 2.30 (Gaussian complexity) This example is closely related to our earlier dis- 
cussion of Rademacher complexity in Example 2.25. Let {W,}7_, be an i.i.d. sequence of 
N(O, 1) variables. Given a collection of vectors A c R”, define the random variable? 


Li sup] >) aw] = sup (a, W}. (2.46) 
acA k=l acA 

As with the Rademacher complexity, the variable Z = Z(A) is one way of measuring the 
size of the set A, and will play an important role in later chapters. Viewing Z as a function 
(W1,---,Wn) Œ f(Wi,..., Wn), let us verify that f is Lipschitz (with respect to Euclidean 
norm) with parameter sup <a llall. Let w, w’ € R” be arbitrary, and let a* € A be any vector 
that achieves the maximum defining f(w). Following the same argument as Example 2.25, 
we have the upper bound 


fw) - fw’) < la, w- w) < D(A) Iw - wb, 


where D(A) = sup, llall is the Euclidean width of the set. The same argument holds with 
the roles of w and w’ reversed, and hence 


fw) = FW) < D(A) Iw = w'lh. 


Consequently, Theorem 2.26 implies that 


62 
PIZ — E[Z]| > 6] < zol- (2.47) 


3 For measurability concerns, see the bibliographic discussion in Chapter 4. 
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Example 2.31 (Gaussian chaos variables) As a generalization of the previous example, let 
Q e R” be a symmetric matrix, and let w, w be independent zero-mean Gaussian random 
vectors with covariance matrix I„. The random variable 


Zi= QijwiWj = w'QW 


i, j=1 


is known as a (decoupled) Gaussian chaos. By the independence of w and w, we have E[Z] = 
0, so it is natural to seek a tail bound on Z. 

Conditioned on w, the variable Z is a zero-mean Gaussian variable with variance lQwIl5 a 
w!Q?w, whence 


ao 
PIZI > 6| W] < 2e 70%, (2.48) 
Let us now control the random variable Y := ||Qwll2. Viewed as a function of the Gaussian 
vector w, it is Lipschitz with constant 
lQll2 := sup ||Qulls, (2.49) 


llul2=1 


corresponding to the £,-operator norm of the matrix Q. Moreover, by Jensen’s inequality, 


we have E[Y] < VE[w'Q?w] = ||Qlllp, where 


CAID (2.50) 


i=l j=l 


is the Frobenius norm of the matrix Q. Putting together the pieces yields the tail bound 


2 
P[lQwllz = IIQllr +4 < zex- 
2 


Note that (IlQlle + O° < 2IQII} + 2°. Consequently, setting 7 = 6||Qll, and simplifying 
yields 


a ô 
PIW QW > 2IIQIIz + 251 Qll2] < 2e- 


Putting together the pieces, we find that 
82 


ô 
PIZI 2 ôl < 2 epl- o aso e ao 
121 = 61 < 2exp( TOR aE) * eol-ze) 


6 
exp- o}: 
AQ: + 46llQll2 


We have thus shown that the Gaussian chaos variable satisfies a sub-exponential tail 
bound. 4 


Example 2.32 (Singular values of Gaussian random matrices) For integers n > d, let 
X € R”™ be a random matrix with i.i.d. N(0, 1) entries, and let 


o\(X) > o2(X) > ++: > og(X) 2 0 
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denote its ordered singular values. By Weyl’s theorem (see Exercise 8.3), given another 
matrix Y € R”*¢, we have 


max, |o(X) — o(Y)] < IIX — Ylllo < IIX — Ylllp, (2.51) 
where || - |p denotes the Frobenius norm. The inequality (2.51) shows that each singular 


value o,(X) is a 1-Lipschitz function of the random matrix, so that Theorem 2.26 implies 
that, for each k = 1,...,d, we have 


Pllo,(X) - El] S] <27 for all 8 > 0. (2.52) 


Consequently, even though our techniques are not yet powerful enough to characterize the 
expected value of these random singular values, we are guaranteed that the expectations are 
representative of the typical behavior. See Chapter 6 for a more detailed discussion of the 
singular values of random matrices. & 


2.4 Appendix A: Equivalent versions of sub-Gaussian variables 
In this appendix, we prove Theorem 2.6. We establish the equivalence by proving the circle 


of implications (D > (D) => CID => (), followed by the equivalence (I) = (IV). 


Implication (1) = (ID: If X is zero-mean and sub-Gaussian with parameter o, then we claim 
that, for Z ~ N(0, 207), 


P[X > t] 
PZ: A < V8e for all t > 0, 


showing that X is majorized by Z with constant c = V8e. On one hand, by the sub- 
Gaussianity of X, we have P[X > t] < exp- ) for all t > 0. On the other hand, by 


the Mills ratio for Gaussian tails, if Z ~ N(0, 20°), then we have 


3 2 
PIZ > t] > = _ (N20) Jee for all t > 0. (2.53) 


t B 


(See Exercise 2.2 for a derivation of this inequality.) We split the remainder of our analysis 
into two cases. 


Case 1: First, suppose that t € [0,20]. Since the function ®(t) = P[Z > t] is decreasing, 
for all ¢ in this interval, 


1 1 1 
PIZ > t] > P[Z > 20] 2> Te . 
i a (5 wal Vie 


Since PLX > t] < 1, we conclude that ped < V8e for all t € [0, 20]. 


Case 2: Otherwise, we may assume that t > 2c. In this case, by combining the Mills 
ratio (2.53) and the sub-Gaussian tail bound and making the substitution s = t/o, we find 
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that 


P[X > ż] 
oe ee —_— 
Sepen PE 


ai 


< sup e`? 
s>2 


< V8e, 


where the last step follows from a numerical calculation 


tT’. Since X** is a non-negative random variable, we have 


Implication (II) => (II): Suppose that X is majorized by a zero-mean Gaussian with variance 


Decal = P[X* > s] as= | PIXI > s'/] ds 
0 


Under the majorization assumption, there is some constant c > 1 such that 


i PIXI > s/] ds < cf PIZI > s/]ds=c 
0 0 


E[Z%], 
where Z ~ N(0, T°). The polynomial moments of Z are given by 
E[Z”™*] = oe , fork = 1,2 (2.54) 
whence 
ELX] < cE[Z2*] =o 2k < (2k)! 


2k = 
S Sip (ct), for all k = 1,2 
Consequently, the moment bound (2.12c) holds with 8 = ct 


Implication (IT) => (1): For each 4 € R, we have 


Oko 
where we have used the fact 


(2.55) 

[X] = 0 to eliminate the term involving k = 1. If X is sym- 

metric around zero, then all of its odd moments vanish, and by applying our assumption on 
A(X), we obtain 


© 2k 192k 
Eje] z140. A“ (2k)!6 


Pe 

= 2 
OGD Fk * 
which shows that X is sub-Gaussian with parameter 0 


> 


[AXP] 2 (Emax PE 


When X is not symmetric, we can bound the odd moments in terms of the even ones as 


(ii) 
2k+27\ 1/2 1¢ 72k 
[XPD < da 


[ [xX] + ek Ç perry): (2.56) 
where step (i) follows from the Cauchy—Schwarz inequality; and step (ii) follows from 
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the arithmetic-geometric mean inequality. Applying this bound to the power-series expan- 
sion (2.55), we obtain 


1o 1 aft lp 1 1 
remy s 1+ (3+ yg ED daw ‘ sl aE Dl” Qk+ il} E 


a02 
<e 2 


= > 


which establishes the claim. 


Implication (1) = (IV): This result is obvious for s = 0. For s € (0,1), we begin with the 
2o 


sub-Gaussian inequality Eļe?*] < er , and multiply both sides by CE, thereby obtaining 


AX— 


Por 
5 


2 ]<e 


Since this inequality holds for all 2 € R, we may integrate both sides over A € R, using 
Fubini’s theorem to justify exchanging the order of integration. On the right-hand side, we 


have 
_ Ro? (s—1) 1 | 2zs 
A= ; 
f S exo 2s Ja ao Vi-s 
Turning to the left-hand side, for each fixed x € IR, we have 


j vo? v2 sx 
Hi exp a a Jaa = Se 
DA KY o 


Taking expectations with respect to X, we conclude that 


sx2 a l 2ns 1 
Efex] < = = > 
Prso NI-s Vil-s 


which establishes the claim. 


Po2(s-1) 
Zs 


Efe 


Implication (IV) > (1): Applying the bound e“ < u + e%/!6 with u = AX and then taking 
expectations, we find that 


[e] < EAX] + Ele] = Ele] < — 
Ele < E Efe © |] = Efez7] < T 
vi-s 
valid whenever s = 2A°o7 is strictly less than 1. Noting that = < e* for all s € [0, 5] and 
that s < } whenever |A| < 2, we conclude that 
Ele] < es?” forall [al < 2. (2.57a) 


It remains to establish a similar upper bound for |A| > 2 
the functions f(u) = i and f*(v) = a are conjugate duals. Thus, the Fenchel—Young 


. Note that, for any œ > 0, 
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inequality with u = A, v = X and a = c/o” for a constant c > 0 to be chosen; doing so yields 


inequality implies that uv < ik + wv valid for all u,v € R and a > 0. We apply this 


ee ee) 22 xL (ii) 22 
T [e°] < E [e tae | = e 2 F [ex ] ® eu e, 


where step (ii) is valid for any c € (0, 1/2), using the same argument that led to the bound 
(2.57a). In particular, setting c = 1/4 yields Efe**] < e?” 7” e!’ 
Finally, when |A| > +, then we have ! < BRO, and hence 


224.9 7252 2 2 
E[e**] < eet ort eto < ere i (2.57b) 


This inequality, combined with the bound (2.57a), completes the proof. 


2.5 Appendix B: Equivalent versions of sub-exponential variables 


This appendix is devoted to the proof of Theorem 2.13. In particular, we prove the chain of 
equivalences I © II © III, followed by the equivalence II © IV. 


(I) = (1): The existence of the moment generating function for |A| < co implies that 
2 2 
Efe] =1+ zox +0(4°) as A > 0. Moreover, an ordinary Taylor-series expansion implies 


o 2 . 
that e = 14 oe + 0(A’) as A > 0. Therefore, as long as oc? > E[X?], there exists some 
22 


b > 0 such that E[e**] < e? for all |A| < 4. 


(D => (ID: This implication is immediate. 
(ID => (ID): For an exponent a > 0 and truncation level T > 0 to be chosen, we have 


E[e"* pet! < e“T]] <f 
0 


Applying the assumed tail bound, we obtain 


eal 


logt 


Ple > drs t+ f p|ixı > Jat. 
1 


a 


aT aT 


5 ca logt á 
Eje” le < ef]]<1+c i e“ dt=1+ce, f 2/4 dt. 
1 1 


Thus, for any a € [0, 2], we have 


Efe fel < eT] < 1+ Sa Jet) 214 - 


By taking the limit as T —> œ, we conclude that E[e“*'] is finite for all a € [0, 2]. Since 
both e°* and e~™ are upper bounded by e'“|, it follows that E[e“*] is finite for all Ja] < $. 


(I) => CID): By the Chernoff bound with 4 = co/2, we have 


_ cot 


PIX >t) < Ele” ]e 7. 


Applying a similar argument to —X, we conclude that P[|X| > ft] < c,e?’ with cy = 
E[e*/?] + Efe~*/?] and cy = co/2. 
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(II) = (IV): Since the moment generating function exists in an open interval around zero, 
we can consider the power-series expansion 


eth Ak E Xk 
ESY l | forall al <a. (2.58) 
k=2 j 


By definition, the quantity y(X) is the radius of convergence of this power series, from which 
the equivalence between (II) and (IV) follows. 


2.6 Bibliographic details and background 


Further background and details on tail bounds can be found in various books (e.g., Saulis and 
Statulevicius, 1991; Petrov, 1995; Buldygin and Kozachenko, 2000; Boucheron et al., 2013). 
Classic papers on tail bounds include those of Bernstein (1937), Chernoff (1952), Bahadur 
and Ranga Rao (1960), Bennett (1962), Hoeffding (1963) and Azuma (1967). The idea of 
using the cumulant function to bound the tails of a random variable was first introduced by 
Bernstein (1937), and further developed by Chernoff (1952), whose name is now frequently 
associated with the method. The book by Saulis and Statulevicius (1991) provides a number 
of more refined results that can be established using cumulant-based techniques. The original 
work of Hoeffding (1963) gives results both for sums of independent random variables, 
assumed to be bounded from above, as well as certain types of dependent random variables, 
including U-statistics. The work of Azuma (1967) applies to general martingales that are 
sub-Gaussian in a conditional sense, as in Theorem 2.19. 

The book by Buldygin and Kozachenko (2000) provides a range of results on sub-Gaussian 
and sub-exponential variates. In particular, Theorems 2.6 and 2.13 are based on results from 
this book. The Orlicz norms, discussed briefly in Exercises 2.18 and 2.19, provide an ele- 
gant generalization of the sub-exponential and sub-Gaussian families. See Section 5.6 and 
the books (Ledoux and Talagrand, 1991; Buldygin and Kozachenko, 2000) for further back- 
ground on Orlicz norms. 

The Johnson—Lindenstrauss lemma, discussed in Example 2.12, was originally proved 
by Johnson and Lindenstrauss (1984) as an intermediate step in a more general result about 
Lipschitz embeddings. The original proof of the lemma was based on random matrices with 
orthonormal rows, as opposed to the standard Gaussian random matrix used here. The use 
of random projection for dimension reduction and algorithmic speed-ups has a wide range 
of applications; see the sources (Vempala, 2004; Mahoney, 2011; Cormode, 2012; Kane and 
Nelson, 2014; Woodruff, 2014; Bourgain et al., 2015; Pilanci and Wainwright, 2015) for 
further details. 

Tail bounds for U-statistics, as sketched out in Example 2.23, were derived by Hoeff- 
ding (1963). The book by de la Peña and Giné (1999) provides more advanced results, 
including extensions to uniform laws for U-processes and decoupling results. The bounded 
differences inequality (Corollary 2.21) and extensions thereof have many applications in the 
study of randomized algorithms as well as random graphs and other combinatorial objects. 
A number of such applications can be found in the survey by McDiarmid (1989), and the 
book by Boucheron et al. (2013). 

Milman and Schechtman (1986) provide the short proof of Gaussian concentration for 
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Lipschitz functions, on which Theorem 2.26 is based. Ledoux (2001) provides an example 
of a Lipschitz function of an i.i.d. sequence of Rademacher variables (i.e., taking values 
{—1, +1} equiprobably) for which sub-Gaussian concentration fails to hold (cf. p. 128 in his 
book). However, sub-Gaussian concentration does hold for Lipschitz functions of bounded 
random variables with an additional convexity condition; see Section 3.3.5 for further de- 
tails. 

The kernel density estimation problem from Exercise 2.15 is a particular form of non- 
parametric estimation; we return to such problems in Chapters 13 and 14. Although we have 
focused exclusively on tail bounds for real-valued random variables, there are many general- 
izations to random variables taking values in Hilbert and other function spaces, as considered 
in Exercise 2.16. The books (Ledoux and Talagrand, 1991; Yurinsky, 1995) contain further 
background on such results. We also return to consider some versions of these bounds in 
Chapter 14. The Hanson—Wright inequality discussed in Exercise 2.17 was proved in the 
papers (Hanson and Wright, 1971; Wright, 1973); see the papers (Hsu et al., 2012b; Rudel- 
son and Vershynin, 2013) for more modern treatments. The moment-based tail bound from 
Exercise 2.20 relies on a classical inequality due to Rosenthal (1970). Exercise 2.21 outlines 
the proof of the rate-distortion theorem for the Bernoulli source. It is a particular instance 
of more general information-theoretic results that are proved using probabilistic techniques; 
see the book by Cover and Thomas (1991) for further reading. The Ising model (2.74) dis- 
cussed in Exercise 2.22 has a lengthy history dating back to Ising (1925). The book by Tala- 
grand (2003) contains a wealth of information on spin glass models and their mathematical 
properties. 


2.7 Exercises 


Exercise 2.1 (Tightness of inequalities) The Markov and Chebyshev inequalities cannot 
be improved in general. 


(a) Provide a non-negative random variable X for which Markov’s inequality (2.1) is met 
with equality. 
(b) Provide a random variable Y for which Chebyshev’s inequality (2.2) is met with equality. 


Exercise 2.2 (Mills ratio) Let ¢(z) = = e~*/? be the density function of a standard normal 
Z ~ N(O, 1) variate. 


(a) Show that #’(z) + z@(z) = 0. 
(b) Use part (a) to show that 


o(a(2 = 5 < P[Z > z] < sol; See 5) for all z > 0. (2.59) 
a: Z p g 


Exercise 2.3 (Polynomial Markov versus Chernoff) Suppose that X > 0, and that the 
moment generating function of X exists in an interval around zero. Given some 6 > 0 and 
integer k = 1,2,..., show that 


ET |X| Fle* 
ed ee eel 
k=0,1,2,.. OF a>0 e% 


(2.60) 
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Consequently, an optimized bound based on polynomial moments is always at least as good 
as the Chernoff upper bound. 


Exercise 2.4 (Sharp sub-Gaussian parameter for bounded random variable) Consider a 
random variable X with mean u = E[X], and such that, for some scalars b > a, X € [a,b] 
almost surely. 


(a) Defining the function (A) = log E[e**], show that y(0) = 0 and (0) = 
(b) Show that y” (4) = E,[X] — (E,[X])”, where we define E,[f(X)] := sts ! Use this 
fact to obtain an upper bound on sup, cp IW” (A). 
b 


(c) Use parts (a) and (b) to establish that X is sub-Gaussian with parameter at most o = =*. 


Exercise 2.5 (Sub-Gaussian bounds and means/variances) Consider a random variable X 
such that 


Fle] < eft forall AER. (2.61) 


(a) Show that E[X] = u 

(b) Show that var(X) < a”. 

(c) Suppose that the smallest possible o satisfying the inequality (2.61) is chosen. Is it then 
true that var(X) = o°? Prove or disprove. 


Exercise 2.6 (Lower bounds on squared sub-Gaussians) Letting {X;}?., be an iid. se- 
quence of zero-mean sub-Gaussian variables with parameter o, consider the normalized 
sum Z, := + Di, X. Prove that 


PIZ, < E[Z,] - 076] < ee"! for all 6 > 0. 


This result shows that the lower tail of a sum of squared sub-Gaussian variables behaves in 
a sub-Gaussian way. 


Exercise 2.7 (Bennett’s inequality) This exercise is devoted to a proof of a strengthening 
of Bernstein’s inequality, known as Bennett’s inequality. 


(a) Consider a zero-mean random variable such that |X;| < b for some b > 0. Prove that 


ab 
log E[e*'] < oii} for alla € R, 


where o° = var(X;). 
(b) Given independent random variables X),...,X;, satisfying the condition of part (a), let 


es 1 1 o% be the average variance. Prove Bennett’s inequality 


[>x > nô sexp{-" p a2), (2.62) 


i=1 
where A(t) := (1 + t) log(1 + t) — t fort > 0. 
(c) Show that Bennett’s inequality is at least as good as Bernstein’s inequality. 
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Exercise 2.8 (Bernstein and expectations) Consider a non-negative random variable that 
satisfies a concentration inequality of the form 


PIZ > t] < Ce mm (2.63) 
for positive constants (v, b) and C > 1. 


(a) Show that E[Z] < 2v(yxr + Vlog C) + 4b(1 + log C). 
(b) Let {X;}_; be an i.i.d. sequence of zero-mean variables satisfying the Bernstein condi- 
tion (2.15). Use part (a) to show that 


1% 20 4b 
| 7x < (v+ Viog2) + (1 + log 2). 


Exercise 2.9 (Sharp upper bounds on binomial tails) Let {X;}7_, be an i.i.d. sequence of 
Bernoulli variables with parameter a € (0, 1/2], and consider the binomial random variable 
Zn = dij Xi. The goal of this exercise is to prove, for any 6 € (0, œ), a sharp upper bound 
on the tail probability P[Z,, < ôn]. 


(a) Show that P[Z, < ôn] < e"?°', where the quantity 


6 1-6 
D(6 || a) := log — + (1 — 6) log ( ) (2.64) 
a (l-a) 
is the Kullback—Leibler divergence between the Bernoulli distributions with parameters 
ô and a, respectively. 
(b) Show that the bound from part (a) is strictly better than the Hoeffding bound for all 


6 €(0,a). 


Exercise 2.10 (Lower bounds on binomial tail probabilities) Let {X;}_, be a sequence of 
i.i.d. Bernoulli variables with parameter a € (0, 1/2], and consider the binomial random vari- 
able Z, = di. Xi. In this exercise, we establish a lower bound on the probability P[Z,, < ôn] 
for each fixed ô € (0, œ), thereby establishing that the upper bound from Exercise 2.9 is 
tight up to a polynomial pre-factor. Throughout the analysis, we define m = [nd], the largest 
integer less than or equal to nô, and set 6 = a 


(a) Prove that * log PIZ, < ôn] = 1 log (") +dloga +(1 ~6) log(1 - a). 
(b) Show that 


loses De (2.65a) 
n 


7 109(") TOR 
n m 


where (6) = -ő log(6) -(d- 6) log(1 — 6) is the binary entropy. (Hint: Let Y be a 
binomial random variable with parameters (n, 6) and show that P[Y = £] is maximized 
when £ = m = Sn.) 

(c) Show that 


1 
PIZ, < én] > —— eo (2.65b) 


where the Kullback—Leibler divergence D(6 || ~) was previously defined (2.64). 
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Exercise 2.11 (Upper and lower bounds for Gaussian maxima) Let {X;}'_, be an i.i.d. se- 


quence of N(0, o) variables, and consider the random variable Z,, := max [X;l. 


(a) Prove that 


4 
[Zn] < V2o2logn + —— _foralln>2. 
V2logn 
(Hint: You may use the tail bound P[U > 6] < A e®/?, valid for any standard normal 


variate.) 
(b) Prove that 


E(Z,] = (1 — 1/e) ¥20? logn for all n > 5. 


> 1 asn > +o. 


E[Z,] 


20? logn 


(c) Prove that 


Exercise 2.12 (Upper bounds for sub-Gaussian maxima) Let {X;};_; be a sequence of zero- 
mean random variables, each sub-Gaussian with parameter o. (No independence assump- 
tions are needed.) 


(a) Prove that 


< y20? logn foralln > 1. (2.66) 


(Hint: The exponential is a convex function.) 


(b) Prove that the random variable Z = max |X;| satisfies 


pest 


E[Z] < 20? log(2n) < 2 Jo? logn, (2.67) 


valid for all n > 2. 


Exercise 2.13 (Operations on sub-Gaussian variables) Suppose that X, and X are zero- 
mean and sub-Gaussian with parameters o and o2, respectively. 


(a) If X; and X, are independent, show that the random variable X; + X2 is sub-Gaussian 
with parameter 4/0? + 05. 

(b) Show that, in general (without assuming independence), the random variable X, + X3 is 
sub-Gaussian with parameter at most V2 ,/o7 + 03. 

(c) In the same setting as part (b), show that X; + X> is sub-Gaussian with parameter at most 
01 +02. 

(d) If X; and X, are independent, show that X,X> is sub-exponential with parameters (v, b) = 
(Voir, V200). 


Exercise 2.14 (Concentration around medians and means) Given a scalar random variable 
X, suppose that there are positive constants c1, C2 such that 


PIX -ElX]}>t<ce" forallt 20. (2.68) 


(a) Prove that var(X) < 2. 
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(b) A median my is any number such that PLX > my] > 1/2 and P[X < my] > 1/2. Show 
by example that the median need not be unique. 

(c) Show that whenever the mean concentration bound (2.68) holds, then for any median 
mx, we have 


P[IX—mx|>th<ce“  forallt >0, (2.69) 


where c; := 4c, and c4 := ¢. 
(d) Conversely, show that whenever the median concentration bound (2.69) holds, then 


mean concentration (2.68) holds with c; = 2c3 and c3 = T 


Exercise 2.15 (Concentration and kernel density estimation) Let {X;}7_, be an i.id. se- 


quence of random variables drawn from a density f on the real line. A standard estimate of 
f is the kernel density estimate 


a 1w_(x-X; 
fal) = 5 ) 


where K:R —> [0, œ) is a kernel function satisfying f 3 K(t)dt = 1, and h > 0 is a bandwidth 


parameter. Suppose that we assess the quality of A using the L!-norm I£,- fili := f > ILO- 
f(@| dt. Prove that 


PIIA- flh > EMF = fhd +0 < e. 


Exercise 2.16 (Deviation inequalities in a Hilbert space) Let {X;};_; be a sequence of in- 
dependent random variables taking values in a Hilbert space H, and suppose that ||X;ll} < b; 
almost surely. Consider the real-valued random variable S , = | D Xill 


(a) Show that, for all 6 > 0, 


PIS» - ELS n]| > nô] < 207$, where b? = t Z% b2. 


(b) Show that P[S: > a +ô] < e#, where a := 4/4 Xi, EIXAB]. 
(Note: See Chapter 12 for basic background on Hilbert spaces.) 


Exercise 2.17 (Hanson—Wright inequality) Given random variables {X;}?_, and a positive 


semidefinite matrix Q € S’”, consider the random quadratic form 


Z= y y Q;;X:X;. (2.70) 


i=l j=l 


The Hanson—Wright inequality guarantees that whenever the random variables {X;}¥_; are 
i.i.d. with mean zero, unit variance, and o-sub-Gaussian, then there are universal constants 


(c1, C2) such that 


2 
P[Z > trace(Q) + ot] < 2exp{ min( Oi On )} (2.71) 
F 


where |||Q||l2 and |||Qlllz denote the operator and Frobenius norms, respectively. Prove this 
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inequality in the special case X; ~ N(0, 1). (Hint: The rotation invariance of the Gaussian 
distribution and sub-exponential nature of y?-variates could be useful.) 


Exercise 2.18 (Orlicz norms) Lety : R} — R, be a strictly increasing convex function 
that satisfies y(0) = 0. The w-Orlicz norm of a random variable X is defined as 


Xl = inf{z > 0 | EW Xp] < 1}, (2.72) 


where ||X'|,, is infinite if there is no finite ¢ for which the expectation E [W(t '|X|)] exists. For 
the functions u +> uf for some q € [1, co], then the Orlicz norm is simply the usual £,-norm 
Xl, = (E [|X|7])'/4. In this exercise, we consider the Orlicz norms || - lly, defined by the 
convex functions y(u) = exp(u’) — 1, for q > 1. 


(a) If ||X|ly, < +, show that there exist positive constants c1, c2 such that 


P[|X| > t] < cı exp(—cot*) for all t > 0. (2.73) 
(In particular, you should be able to show that this bound holds with cı = 2 and c = 
Xp!) 
(b) Suppose that a random variable Z satisfies the tail bound (2.73). Show that ||X|l,, is 
finite. 


Exercise 2.19 (Maxima of Orlicz variables) Recall the definition of Orlicz norm from 
Exercise 2.18. Let {X;}!, be an i.i.d. sequence of zero-mean random variables with finite 
Orlicz norm o = ||X;lly. Show that 


Exercise 2.20 (Tail bounds under moment conditions) Suppose that {X;}"_, are zero-mean 
and independent random variables such that, for some fixed integer m > 1, they satisfy the 
moment bound ||X;llzm := (E [X?”"]) 2n < Cm. Show that 
2m 
< B,( for all 6 > 0, 


rls vind 


where B,, is a universal constant depending only on C,, and m. 
Hint: You may find the following form of Rosenthal’s inequality to be useful. Under the 
stated conditions, there is a universal constant R,, such that 


[ST {See Se) 


i=1 i=1 i=1 


>ô 


Exercise 2.21 (Concentration and data compression) Let X = (X1, X2,..., Xn) be a vec- 
tor of i.i.d. Bernoulli variables with parameter 1/2. The goal of lossy data compression is 
to represent X using a collection of binary vectors, say {z!,...,z™}, such that the rescaled 


Hamming distortion 


; 1< 
ar . i ON : ae A J 
d(X) := min, pH(X, z’) = ni, > I[X; + al 


JAGAN er eS jl = 
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is as small as possible. Each binary vector z/ is known as a codeword, and the full collection 
is called a codebook. Of course, one can always achieve zero distortion using a codebook 
with N = 2” codewords, so the goal is to use N = 2®™ codewords for some rate R < 1. In this 
exercise, we use tail bounds to study the trade-off between the rate R and the distortion 6. 


(a) Suppose that the rate R is upper bounded as 


R < D,(6|| 1/2) = dlog, is + (1 — ô) log, = 
Show that, for any codebook {z',...,2”} with N < 2”? codewords, the probability of 
the event {d(X) < 6} goes to zero as n goes to infinity. (Hint: Let V/ be a {0,1}-valued 
indicator variable for the event py(X,z/) < 6, and define V = ye Vİ. The tail bounds 
from Exercise 2.9 could be useful in bounding the probability P[V > 1].) 

(b) We now show that, if AR := R — D2(6||1/2) > 0, then there exists a codebook that 
achieves distortion 6. In order to do so, consider a random codebook {Z!,...,Z%}, 
formed by generating each codeword Z/ independently, and with all i.i.d. Ber(1/2) en- 
tries. Let V/ be an indicator for the event py(X, Z/) < 6, and define V = yy Vi 


(i) Show that P[V > 1] > GU 
(ii) Use part (i) to show that P[V > 1] —> +œ as n > +00. (Hint: The tail bounds from 


Exercise 2.10 could be useful.) 


Exercise 2.22 (Concentration for spin glasses) For some positive integer d > 2, consider a 
collection {6 jx} )4x of weights, one for each distinct pair j + k of indices in {1,2,...,d}. We 
can then define a probability distribution over the Boolean hypercube {—1, +1} via the mass 
function 


Po(x1,-.-,%2) = exp { 7 Do Bex jXe — Fu}, (2.74) 


i+j 


where the function Fy : RG) — R, known as the free energy, is given by 


F,(0) = tog ma apf- -Donn (2.75) 


xe{—1,+1}4 Vd “A 


serves to normalize the distribution. The probability distribution (2.74) was originally used 
to describe the behavior of magnets in statistical physics, in which context it is known as 
the Ising model. Suppose that the weights are chosen as i.i.d. random variables, so that 
equation (2.74) now describes a random family of probability distributions. This family is 
known as the Sherrington—Kirkpatrick model in statistical physics. 


(a) Show that F; is a convex function. 

(b) For any two vectors 6, 6’ € RO), show that ||F7(@) — Fa(@)|l2 < < Vd|l@- 6'lh. 

(c) Suppose that the weights are chosen in an i.i.d. manner as 6; ~ N(0, 8°) for each j # k. 
Use the previous parts and Jensen’s inequality to show that 


e 


p| > log2 + — 7 


+t| <20 forallt> 0. (2.76) 
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Remark: Interestingly, it is known that, for any 6 € [0, 1), this upper tail bound captures 
the asymptotic behavior of F4(0)/d accurately, in that fot E log2 + 8?/4 as d > œ. By 
contrast, for 6 > 1, the behavior of this spin glass model is much more subtle; we refer the 


reader to the bibliographic section for additional reading. 


3 


Concentration of measure 


Building upon the foundation of Chapter 2, this chapter is devoted to an exploration of more 
advanced material on the concentration of measure. In particular, our goal is to provide an 
overview of the different types of methods available to derive tail bounds and concentration 
inequalities. We begin in Section 3.1 with a discussion of the entropy method for concen- 
tration, and illustrate its use in deriving tail bounds for Lipschitz functions of independent 
random variables. In Section 3.2, we turn to some geometric aspects of concentration in- 
equalities, a viewpoint that is historically among the oldest. Section 3.3 is devoted to the use 
of transportation cost inequalities for deriving concentration inequalities, a method that is in 
some sense dual to the entropy method, and well suited to certain types of dependent random 
variables. We conclude in Section 3.4 by deriving some tail bounds for empirical processes, 
including versions of the functional Hoeffding and Bernstein inequalities. These inequalities 
play an especially important role in our later treatment of nonparametric problems. 


3.1 Concentration by entropic techniques 


We begin our exploration with the entropy method and related techniques for deriving con- 
centration inequalities. 


3.1.1 Entropy and its properties 


Given a convex function ¢: R — R, it can be used to define a functional on the space of 
probability distributions via 


HX) := ELX] - (ELX), 


where X ~ P. This quantity, which is well defined for any random variable such that both 
X and ¢(X) have finite expectations, is known as the ¢-entropy! of the random variable X. 
By Jensen’s inequality and the convexity of ¢, the -entropy is always non-negative. As the 
name suggests, it serves as a measure of variability. For instance, in the most extreme case, 
we have Hg(X) = 0 for any random variable such that X is equal to its expectation P-almost- 
everywhere. 


! The notation Hg(X) has the potential to mislead, since it suggests that the entropy is a function of X, and 
hence a random variable. To be clear, the entropy Hg is a functional that acts on the probability measure P, as 
opposed to the random variable X. 
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There are various types of entropies, depending on the choice of the underlying convex 
function ¢. Some of these entropies are already familiar to us. For example, the convex 
function (u) = u? yields 


H(X) = E[X*] - (EIX? = var(X), 


corresponding to the usual variance of the random variable X. Another interesting choice is 
the convex function ¢(u) = —logu defined on the positive real line. When applied to the 
positive random variable Z := e?*, this choice of ¢ yields 


Hye") = -AE[X] + log Efe"] = log E[e’™ t9], 


a type of entropy corresponding to the centered cumulant generating function. In Chapter 2, 
we have seen how both the variance and the cumulant generating function are useful objects 
for obtaining concentration inequalities—in particular, in the form of Chebyshev’s inequal- 
ity and the Chernoff bound, respectively. 

Throughout the remainder of this chapter, we focus on a slightly different choice of en- 
tropy functional, namely the convex function ¢: [0, co) — R defined as 


d(u):=ulogu foru >Q, and ¢(0):=0. (3.1) 


For any non-negative random variable Z > 0, it defines the ¢-entropy given by 


H(Z) = E[Z log Z] — E[Z] log E[Z], (3.2) 


assuming that all relevant expectations exist. In the remainder of this chapter, we omit the 
subscript ¢, since the choice (3.1) is to be implicitly understood. 

The reader familiar with information theory may observe that the entropy (3.2) is closely 
related to the Shannon entropy, as well as the Kullback—Leibler divergence; see Exercise 3.1 
for an exploration of this connection. As will be clarified in the sequel, the most attractive 
property of the ¢-entropy (3.2) is its so-called tensorization when applied to functions of 
independent random variables. 

For the random variable Z := e**, the entropy has an explicit expression as a function of 
the moment generating function y,(A) = E[e**] and its first derivative. In particular, a short 
calculation yields 


H(e™) = Ag, (A) - vx(A) log yA). (3.3) 


Consequently, if we know the moment generating function of X, then it is straightforward to 
compute the entropy H(e**). Let us consider a simple example to illustrate: 


Example 3.1 (Entropy of a Gaussian random variable) For the scalar Gaussian variable 
X ~ N0, 0”), we have y,(A) = e*°”/. By taking derivatives, we find that g} (4) = Ao?y,(A), 
and hence 


H(e™*) = 2a? x(a) — 52707 g(a) = $2707 &(A). (3.4) 
& 


Given that the moment generating function can be used to obtain concentration inequali- 
ties via the Chernoff method, this connection suggests that there should also be a connection 
between the entropy (3.3) and tail bounds. It is the goal of the following sections to make 
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this connection precise for various classes of random variables. We then show how the en- 
tropy based on ¢(u) = ulog u has a certain tensorization property that makes it particularly 
well suited to dealing with general Lipschitz functions of collections of random variables. 


3.1.2 Herbst argument and its extensions 


Intuitively, the entropy is a measure of the fluctuations in a random variable, so that control 
on the entropy should translate into bounds on its tails. The Herbst argument makes this 
intuition precise for a certain class of random variables. In particular, suppose that there is a 
constant o > 0 such that the entropy of e** satisfies an upper bound of the form 


He™*) < $072? (A). (3.5) 


Note that by our earlier calculation in Example 3.1, any Gaussian variable X ~ N(0, o°?) 
satisfies this condition with equality for all A € R. Moreover, as shown in Exercise 3.7, any 
bounded random variable satisfies an inequality of the form (3.5). 

Of interest here is the other implication: What does the entropy bound (3.5) imply about 
the tail behavior of the random variable? The classical Herbst argument answers this ques- 
tion, in particular showing that any such variable must have sub-Gaussian tail behavior. 


Proposition 3.2 (Herbst argument) Suppose that the entropy H(e**) satisfies inequal- 
ity (3.5) for all A € I, where I can be either of the intervals [0, co) or R. Then X satisfies 
the bound 


log Efe] < tao? forall A € 1. (3.6) 


d 


Remarks: When I = R, then the inequality (3.6) is equivalent to asserting that the cen- 
tered variable X — E[X] is sub-Gaussian with parameter ø. Via an application of the usual 
Chernoff argument, the bound (3.6) with Z = [0, œ) implies the one-sided tail bound 


PIX>E[X] +1] < e$, 3.7) 


and with J = R, it implies the two-sided bound P[|X — ELX]| > t] < Qe" 37. Of course, these 
are the familiar tail bounds for sub-Gaussian variables discussed previously in Chapter 2. 


Proof Recall the representation (3.3) of entropy in terms of the moment generating func- 
tion. Combined with the assumed upper bound (3.5), we conclude that the moment generat- 
ing function y = gx satisfies the differential inequality 


Ag’ (A) = A) log g(a) < $072? gA), valid for all A > 0. (3.8) 

Define the function G(A) = 1 log y(A) for A + 0, and extend the definition by continuity to 
G(0) := lim G) = ELX]. (3.9) 
Note that we have G’(A) = 120 = = log y(A), so that the inequality (3.8) can be rewritten 
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in the simple form G’(A) < io? for all A € J. For any Ag > 0, we can integrate both sides of 
the inequality to obtain 


G(4) - G(Ao) < 0° (A = Ao). 


Letting 29 — 0* and using the relation (3.9), we conclude that 


G(A) - E[X] < 407A, 


which is equivalent to the claim (3.6). We leave the extension of this proof to the case J = R 
as an exercise for the reader. 


Thus far, we have seen how a particular upper bound (3.5) on the entropy H(e**) translates 
into a bound on the cumulant generating function (3.6), and hence into sub-Gaussian tail 
bounds via the usual Chernoff argument. It is natural to explore to what extent this approach 
may be generalized. As seen previously in Chapter 2, a broader class of random variables 
are those with sub-exponential tails, and the following result is the analog of Proposition 3.2 
in this case. 


Proposition 3.3 (Bernstein entropy bound) Suppose that there are positive constants 
b and o such that the entropy H(e**) satisfies the bound 


Hie**) < a*{by (A) + y (Ao? — bE[X])} ~— for all A € [0, 1/b). (3.10) 
Then X satisfies the bound 
log Efe? HXD] < o?4?(1 -bay! forall A € [0,1/b). (3.11) 
© d 


Remarks: As a consequence of the usual Chernoff argument, Proposition 3.3 implies that 
X satisfies the upper tail bound 
2 


40? + 2b6 
which (modulo non-optimal constants) is the usual Bernstein-type bound to be expected for 


a variable with sub-exponential tails. See Proposition 2.10 from Chapter 2 for further details 
on such Bernstein bounds. 


P[X > E[X] +ô] < exp| | for all 6 > 0, (3.12) 


We now turn to the proof of Proposition 3.3. 


Proof As before, we omit the dependence of y, on X throughout this proof so as to simplify 
notation. By rescaling and recentering arguments sketched out in Exercise 3.6, we may as- 
sume without loss of generality that E[X] = 0 and b = 1, in which case the inequality (3.10) 
simplifies to 


H(e**) < PPD + pA)o?} for all A € [0, 1). (3.13) 
Recalling the function G(A) = 1 log (4) from the proof of Proposition 3.2, a little bit of 
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algebra shows that condition (3.13) is equivalent to the differential inequality G’ < o? + £, 
Letting 2ọ > O be arbitrary and integrating both sides of this inequality over the interval 
(Ao, 2), we obtain 


G(A) — G(Ao) < 0°(A = Ao) + log Y(A) — log glo). 


Since this inequality holds for all Ap > 0, we may take the limit as Ao — 0*. Doing so and 
using the facts that limj,.0+ G(Ao) = G(0) = ELX] and log y(0) = 0, we obtain the bound 


G(A) — E[X] < 0° A + log (å). (3.14) 


Substituting the definition of G and rearranging yields the claim (3.11). 


3.1.3 Separately convex functions and the entropic method 


Thus far, we have seen how the entropic method can be used to derive sub-Gaussian and 
sub-exponential tail bounds for scalar random variables. If this were the only use of the 
entropic method, then we would have gained little beyond what can be done via the usual 
Chernoff bound. The real power of the entropic method—as we now will see—manifests 
itself in dealing with concentration for functions of many random variables. 

As an illustration, we begin by stating a deep result that can be proven in a relatively 
direct manner using the entropy method. We say that a function f: R” —> R is separately 


convex if, for each index k € {1,2,...,n}, the univariate function 
Yk > Ff (x1, X2, «eeo Xk-1 Yks Xkt lo ees Xn) 
is convex for each fixed vector (x1, X2, .-., Xk-1s Xk4ls :- -3 Xn) € RI. A function f is L- 


Lipschitz with respect to the Euclidean norm if 
f(x) -fN < Lll- x'lle for all x, x’ € R”. (3.15) 


The following result applies to separately convex and L-Lipschitz functions. 


Theorem 3.4 Let {X;}?_, be independent random variables, each supported on the 
interval [a,b], and let f: R" — R be separately convex, and L-Lipschitz with respect 
to the Euclidean norm. Then, for all 6 > 0, we have 


2 
P[f(X) = ELX] + 6] < a|- (3.16) 


XM 


Remarks: This result is the analog of the upper tail bound for Lipschitz functions of Gaus- 
sian variables (cf. Theorem 2.26 in Chapter 2), but applicable to independent and bounded 
variables instead. In contrast to the Gaussian case, the additional assumption of separate 
convexity cannot be eliminated in general; see the bibliographic section for further discus- 
sion. When f is jointly convex, other techniques can be used to obtain the lower tail bound 
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as well; see Theorem 3.24 in the sequel for one such example. 


Theorem 3.4 can be used to obtain order-optimal bounds for a number of interesting prob- 
lems. As one illustration, we return to the Rademacher complexity, first introduced in Ex- 
ample 2.25 of Chapter 2. 


Example 3.5 (Sharp bounds on Rademacher complexity) Given a bounded subset A c R”, 
consider the random variable Z = suppea dip 4&kEk, Where ep € {—1, +1} are i.i.d. Rade- 
macher variables. Let us view Z as a function of the random signs, and use Theorem 3.4 to 
bound the probability of the tail event {Z > E[Z] + ¢}. 

It suffices to verify the convexity and Lipschitz conditions of the theorem. First, since 
Z = Z(E\,...,€,) is the maximum of a collection of linear functions, it is jointly (and hence 
separately) convex. Let Z’ = Z(e',...,€,) where e’ € {—1,+1}" is a second vector of sign 
variables. For any a € A, we have 


1 
n 


(a, €) =- Z' = (a, £) — sup (a’, €’) < (a, € - £'} < llallz lle — £'ll2. 

ae d'EA 

Štai Ek 
Taking suprema over a € A yields that Z — Z’ < (sup,c,|lallz) lle — e’|l2. Since the same 
argument may be applied with the roles of £ and z’ reversed, we conclude that Z is Lipschitz 


with parameter W(A) := sup ea llall2, corresponding to the Euclidean width of the set. 
Putting together the pieces, Theorem 3.4 implies that 


2 
P[Z > E[Z] +t] < -— z]. 3.17 
[Z > E[Z] +0 exp| ae (3.17) 
Note that parameter W?(A) may be substantially smaller than the quantity )/7_; supjeq 47 
—indeed, possibly as much as a factor of n smaller! In such cases, Theorem 3.4 yields a 
much sharper tail bound than our earlier tail bound from Example 2.25, which was obtained 
by applying the bounded differences inequality. 4 


Another use of Theorem 3.4 is in random matrix theory. 


Example 3.6 (Operator norm of a random matrix) Let X € R”*? be a random matrix, 
say with X;; drawn i.i.d. from some zero-mean distribution supported on the unit interval 
[-1, +1]. The spectral or £,-operator norm of X, denoted by |||X|l2, is its maximum singular 
value, given by 


IIXIl> = max ||Xv| = max max u'Xv. (3.18) 
ve? veR? uc" 
livil2=1 livil2=1 llull2=1 


Let us view the mapping X > |X| as a function f from R”? to R. In order to apply 
Theorem 3.4, we need to show that f is both Lipschitz and convex. From its definition (3.18), 
the operator norm is the supremum of a collection of functions that are linear in the entries 
X; any such supremum is a convex function. Moreover, we have 


„Ô „n 6 i 
IXI - WX’ lla] < IX - X’ < IX - X'Ilr, (3.19) 


where step (i) follows from the triangle inequality, and step (ii) follows since the Frobenius 
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norm of a matrix always upper bounds the operator norm. (The Frobenius norm |||M|llp of 
a matrix M € R” is simply the Euclidean norm of all its entries; see equation (2.50).) 
Consequently, the operator norm is Lipschitz with parameter L = 1, and thus Theorem 3.4 
implies that 


PIXI > EMX] + 5] < e77. 


It is worth observing that this bound is the analog of our earlier bound (2.52) on the oper- 
ator norm of a Gaussian random matrix, albeit with a worse constant. See Example 2.32 in 
Chapter 2 for further details on this Gaussian case. + 


3.1.4 Tensorization and separately convex functions 


We now return to prove Theorem 3.4. The proof is based on two lemmas, both of which are 
of independent interest. Here we state these results and discuss some of their consequences, 
deferring their proofs to the end of this section. Our first lemma establishes an entropy bound 
for univariate functions: 


Lemma 3.7 (Entropy bound for univariate functions) Let X,Y ~ P be a pair of 
i.i.d. variates. Then for any function g: R —> R, we have 


He’) < PE[(g(X) - eY) e% If g(X) > e(Y)]]  forallà>0.  (3.20a) 
If in addition X is supported on [a,b], and g is convex and Lipschitz, then 
H(e®™) < (b -aY Eg (X) es]  forallA >O, (3.20b) 


where g' is the derivative. 


d 


In stating this lemma, we have used the fact that any convex and Lipschitz function has a 
derivative defined almost everywhere, a result known as Rademacher’s theorem. Moreover, 
note that if g is Lipschitz with parameter L, then we are guaranteed that ||¢’||.. < L, so that 
inequality (3.20b) implies an entropy bound of the form 


Hes) < AL?(b — a)* Efe] for all A > 0. 


In turn, by an application of Proposition 3.2, such an entropy inequality implies the upper 
tail bound 


PIX > Elg(X)] + 6] < T, 


Thus, Lemma 3.7 implies the univariate version of Theorem 3.4. However, the inequal- 
ity (3.20b) is sharper, in that it involves g’(X) as opposed to the worst-case bound L, and this 
distinction will be important in deriving the sharp result of Theorem 3.4. The more general 
inequality (3.20b) will be useful in deriving functional versions of the Hoeffding and Bern- 
stein inequalities (see Section 3.4). 
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Returning to the main thread, it remains to extend this univariate result to the multivariate 
setting, and the so-called tensorization property of entropy plays a key role here. Given 
a function f: R” > R, an index k € {1,2,...,n} and a vector x, = (x; i + k) € R™', we 
define the conditional entropy in coordinate k via 


where fg: R — R is the coordinate function xy œ f(x1,...,Xk,---, Xn). To be clear, for a 
random vector X“ € R""!, the entropy H(e*«» | X\“) is a random variable, and its expecta- 
tion is often referred to as the conditional entropy.) The following result shows that the joint 
entropy can be upper bounded by a sum of univariate entropies, suitably defined. 


Lemma 3.8 (Tensorization of entropy) Let f: R” — R, and let {X;}/_, be independent 
random variables. Then 


Ds Heeh | x) forall A>0. (3.21) 


k=1 


Equipped with these two results, we are now ready to prove Theorem 3.4. 


Proof of Theorem 3.4 For any k € {1,2,...,n} and fixed vector x, € R”-!, our assumptions 
imply that the coordinate function f is convex, and hence Lemma 3.7 implies that, for all 
A > 0, we have 


Heer | xy) < R - a) Ex, [XP eA | xy] 


2 
= 2 (b - ay (žna) efe s Kids | 
j OXk ? 


where the second line involves unpacking the definition of the conditional entropy. 
Combined with Lemma 3.8, we find that the unconditional entropy is upper bounded as 


He!) < 2? (b - a) E | y (“ee O) ei 


kel Ox; 


È 2%(b - aP E Ele ®], 


Here step (i) follows from the Lipschitz condition, which guarantees that 


n 2 
wrok = (FS) <e 


fel OX, 


almost surely. Thus, the tail bound (3.16) follows from an application of Proposition 3.2. 


It remains to prove the two auxiliary lemmas used in the preceding proof—namely, Lemma 
3.7 on entropy bounds for univariate Lipschitz functions, and Lemma 3.8 on the tensoriza- 
tion of entropy. We begin with the former property. 
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Proof of Lemma 3.7 
By the definition of entropy, we can write 


Hes) =F xlAg(X)e™] =E xle®] log ( F yle ”®]) 


Ey[Ag(X)e#] — Ex yfe**™ ag(¥)] 
=; Exy|A {e(X) — g(Y)} {e8 — e's] 


@ OF s ~ ge — e8) Iig) > gon], (3.22) 


where step (i) follows from Jensen’s inequality, and step (ii) follows from symmetry of X 
and Y. 
By convexity of the exponential, we have e* — e' < e°(s — t) for all s,t € R. For s > t, we 
can multiply both sides by (s — t) > 0, thereby obtaining 
(s = Ale — e) lEs > t] < (s -te lEs > t). 


Applying this bound with s = Ag(X) and t = Ag(Y) to the inequality (3.22) yields 


Hee) < 2 EEX) - ge Ulg(X) > g0), (3.23) 


where we have recalled the assumption that 2 > 0. 
If in addition g is convex, then we have the upper bound g(x) — g(y) < g’(x)(x — y), and 
hence, for g(x) > g(y), 


(g(x) = gO) < EOE- yF S (8) - a, 


where the final step uses the assumption that x,y € [a,b]. Combining the pieces yields the 
claim. 


We now turn to the tensorization property of entropy. 


Proof of Lemma 3.8 


The proof makes use of the following variational representation for entropy: 


Hee) = supf Elg Xe] | Ele] < 1}. (3.24) 
£ 


This equivalence follows by a duality argument that we explore in Exercise 3.9. 
For each j € {1,2,...,n}, define X; = (X;,...,X,). Let g be any function that satisfies 
E[e)] < 1. We can then define an auxiliary sequence of functions {g!,..., g”} via 


g'(X1,...,Xn) = g(X) — log Efe®™ | X3] 


and 


Fes | X7] 


k 
& (Xk, ...,Xn) = log —— 
E [es | XP] 


fork =2,...,n. 


By construction, we have 


D (Xe Xn) = 80 - log Ele") > a(X) (3.25) 
k=1 


3.2 A geometric perspective on concentration 67 


and moreover E[exp(g*(X;, Xku1s--»»Xn)) | Xi] = 1. 
We now use this decomposition within the variational representation (3.24), thereby ob- 


taining the chain of upper bounds 


On 
Heje Oa S E eca Xe] 
k=1 


= X Ex [Exe (Xe... Xe! | XA 
k=1 

(ii) 

< Ex He" | Xa], 
k=1 


where inequality (i) uses the bound (3.25), and inequality (ii) applies the variational 
representation (3.24) to the univariate functions, and also makes use of the fact that 
E[g(X;,...,Xn) | Xia] = 1. Since this argument applies to any function g such that E[e*™] < 
1, we may take the supremum over the left-hand side, and combined with the variational rep- 
resentation (3.24), we conclude that 


Hie) < X, Ex [He | XW], 
k=l 


as claimed. 


3.2 A geometric perspective on concentration 


We now turn to some geometric aspects of the concentration of measure. Historically, this 
geometric viewpoint is among the oldest, dating back to the classical result of Lévy on 
concentration of measure for Lipschitz functions of Gaussians. It also establishes deep links 
between probabilistic concepts and high-dimensional geometry. 

The results of this section are most conveniently stated in terms of a metric measure 
space—namely, a metric space (X, p) endowed with a probability measure P on its Borel 
sets. Some canonical examples of metric spaces for the reader to keep in mind are the set 
X = R” equipped with the usual Euclidean metric p(x, y) := ||x — yll2, and the discrete cube 
X = {0, 1}" equipped with the Hamming metric p(x, y) = ei lix; # yj]. 

Associated with any metric measure space is an object known as its concentration func- 
tion, which is defined in a geometric manner via the e-enlargements of sets. The concentra- 
tion function specifies how rapidly, as a function of e, the probability of any e-enlargement 
increases towards one. As we will see, this function is intimately related to the concentration 
properties of Lipschitz functions on the metric space. 


3.2.1 Concentration functions 


Given a set A C X and a point x € X, define the quantity 


p(x, A) := inf p(x, y), (3.26) 
ye 
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which measures the distance between the point x and the closest point in the set A. Given a 
parameter e€ > 0, the e-enlargement of A is given by 


Af := {x € X | p(x, A) < €}. (3.27) 


In words, the set A‘ corresponds to the open neighborhood of points lying at distance less 
than e from A. With this notation, the concentration function of the metric measure space 
(X,p, P) is defined as follows: 


Definition 3.9 The concentration function a: [0,00) — R, associated with metric 
measure space (P, X, p) is given by 


a P(xp)(€) := uel ae al PIA] 2 3}, (3.28) 


where the supremum is taken over all measurable subsets A. 
d 


When the underlying metric space (X, p) is clear from the context, we frequently use the ab- 
breviated notation a p. It follows immediately from the definition (3.28) that a p(e) € [0, 1] 
for all e > 0. Of primary interest is the behavior of the concentration function as € increases, 
and, more precisely, how rapidly it approaches zero. Let us consider some examples to il- 
lustrate. 


Example 3.10 (Concentration function for sphere) Consider the metric measure space 
defined by the uniform distribution over the n-dimensional Euclidean sphere 


St! := {x € R” | lll, = 1}, (3.29) 


equipped with the geodesic distance p(x, y) := arccos (x, y}. Let us upper bound the concen- 
tration function œs- defined by the triplet (P,S”~', p), where P is the uniform distribution 
over the sphere. For each y € S”~', we can define the hemisphere 


Hy := {x € S"" | p(x, y) > 2/2} = {x € S™! | (x, y) < 0}, (3.30) 


as illustrated in Figure 3.1(a). With some simple geometry, it can be shown that its e- 
enlargement corresponds to the set 


He = {z € S"! | (z, y) < sin(©}}, (3.31) 


as illustrated in Figure 3.1(b). Note that P[H,] = 1/2, so that the hemisphere (3.30) is a 
candidate set for the supremum defining the concentration function (3.28). The classical 
isoperimetric theorem of Lévy asserts that these hemispheres are extremal, meaning that 
they achieve the supremum, viz. 


asmı(€) = 1 — PLAS]. (3.32) 


Let us take this fact as given, and use it to compute an upper bound on the concentration 
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Z 


X 


(a) (b) (c) 


Figure 3.1 (a) Idealized illustration of the sphere S’~'. Any vector y € S”! defines 
a hemisphere Hy = {x € S’-! | (x, y} < 0}, corresponding to those vectors whose 
angle 6 = arccos (x, yy with y is at least 7/2 radians. (b) The e-enlargement of the 
hemisphere Hy. (c) A central slice T,(€) of the sphere of width e. 


function. In order to do so, we need to lower bound the probability P[H}]. Since sin(e) = €/2 
for all e € (0, 2/2], the enlargement contains the set 


Hé := {z € S""| (z, y) < $e}, 


and hence P[H flies PIHS ]. Finally, a geometric calculation, left as an exercise for the reader, 
yields that, for all e € (0, 2), we have 


2, n/2 
PIAS] > 1- (1 - (5) Sica (3.33) 


where we have used the inequality (1 — f) < e with t = e?/4. We thus obtain that the 
concentration function is upper bounded as agr-i(€) < e"©/8. A similar but more careful 
approach to bounding P[H,] can be used to establish the sharper upper bound 


Weitere Z ET: (3.34) 


The bound (3.34) is an extraordinary conclusion, originally due to Lévy, and it is worth 
pausing to think about it in more depth. Among other consequences, it implies that, if we 
consider a central slice of the sphere of width e, say a set of the form 


T,(€) = {z € S"! | Kz, Y| < €/2}, (3.35) 


as illustrated in Figure 3.1(c), then it occupies a huge fraction of the total volume: in par- 
ticular, we have P[T\(€)] > 1- V2 exp(—25). Moreover, this conclusion holds for any 
such slice. To be clear, the two-dimensional instance shown in Figure 3.1(c)—like any low- 
dimensional example—fails to capture the behavior of high-dimensional spheres. In general, 
our low-dimensional intuition can be very misleading when applied to high-dimensional set- 
tings. & 
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3.2.2 Connection to Lipschitz functions 


In Chapter 2 and the preceding section of this chapter, we explored some methods for ob- 
taining deviation and concentration inequalities for various types of Lipschitz functions. The 
concentration function @ p(x p) turns out to be intimately related to such results on the tail be- 
havior of Lipschitz functions. In particular, suppose that a function f: X — R is L-Lipschitz 
with respect to the metric o—that is, 


If) -= fO < Ley) forall x,y € X. (3.36) 
Given a random variable X ~ P, let my be any median of f(X), meaning a number such that 
PLS(X) > my] = 1/2 and P[f(X) < my] > 1/2. (3.37) 


Define the set A = {x € X | f(x) < my}, and consider its -enlargement A‘!+, For any 
x € A‘, there exists some y € A such that p(x,y) < €/L. Combined with the Lipschitz 


property, we conclude that |f(y) — f(x)| < L(x, y) < €, and hence that 
Al” C {xe X | f(x) < my +e. (3.38) 


Consequently, we have 


PLS(X) > mp + €] Te PA“ 2 a ple/L), 


where inequality (i) follows from the inclusion (3.38), and inequality (ii) uses the fact 
P[A] > 1/2, and the definition (3.28). Applying a similar argument to —f yields an analo- 
gous left-sided deviation inequality P[f(X) < mp — €] < «œ p(e/L), and putting together the 
pieces yields the concentration inequality 


PI f(X) -mpl = e] < 2a p(e/L). 


As shown in Exercise 2.14 from Chapter 2, such sharp concentration around the median is 
equivalent (up to constant factors) to concentration around the mean. Consequently, we have 
shown that bounds on the concentration function (3.28) imply concentration inequalities for 
any Lipschitz function. This argument can also be reversed, yielding the following equiva- 
lence between control on the concentration function, and the behavior of Lipschitz functions. 


Proposition 3.11 Given a random variable X ~ P and concentration function a p, 
any \-Lipschitz function on (X, p) satisfies 


PL f(X) — ml 2 e] < 2a P(e), (3.39a) 


where my is any median of f. Conversely, suppose that there is a function p: R} > R, 
such that, for any 1-Lipschitz function on (X, p), 


PIX) = ELfCX)] + €l] < BO for alle = 0. (3.39b) 


Then the concentration function satisfies the bound a p(€) < B(E/2). 


3.2 A geometric perspective on concentration 71 


Proof It remains to prove the converse claim. Fix some e > 0, and let A be an arbitrary 
measurable set with P[A] > 1/2. Recalling the definition of p(x, A) from equation (3.26), 
let us consider the function f(x) := min{o(x, A), e}. It can be seen that f is 1-Lipschitz, and 
moreover that 1 — P[A*] = P[f(X) = €]. On the other hand, our construction guarantees that 


EL f(X)] < (1 - P[A]e < €/2, 


whence we have 


P[f(X) > €] < PIX) = ELf(®)] + €/2] < BlE/2), 


where the final inequality uses the assumed condition (3.39b). 


Proposition 3.11 has a number of concrete interpretations in specific settings. 


Example 3.12 (Lévy concentration on S’""') From our earlier discussion in Example 3.10, 
the concentration function for the uniform distribution over the sphere S”! can be upper 


bounded as 
@smi (€) < Aas 


Consequently, for any 1-Lipschitz function f defined on the sphere S”-!, we have the two- 
sided bound 


PIF —m,| > €] < Vine, (3.40) 


where m, is any median of f. Moreover, by the result of Exercise 2.14(d), we also have 


PIX - EOI > €] < 2 Vre. (3.41) 
& 


Example 3.13 (Concentration for Boolean hypercube) Consider the Boolean hypercube 
X = {0, 1}" equipped with the usual Hamming metric 
pax, y) = $ lx; # yj). 
j=l 


Given this metric, we can define the Hamming ball 


Bu(r; x) = {y € {0, 1}" | px, x) < r} 


of radius r centered at some x € {0, 1}”. Of interest here are the Hamming balls centered at 
the all-zeros vector 0 and all-ones vector 1, respectively. In particular, in this example, we 
show how a classical combinatorial result due to Harper can be used to bound the concen- 
tration function of the metric measure space consisting of the Hamming metric along with 
the uniform distribution P. 

Given two non-empty subsets A and B of the binary hypercube, one consequence of 
Harper’s theorem is that we can always find two positive integers r4 and rg, and associ- 
ated subsets A’ and B’, with the following properties: 


e the sets A’ and B’ are sandwiched as 


Bu(ra — 150) GA’ CBy(r430) and By(rg — 131) C B c Bale; 1); 
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e the cardinalities are matched as card(A) = card(A’) and card(B) = card(B’); 
e we have the lower bound py(A’, B’) > px(A, B). 


Let us now show that this combinatorial theorem implies that 


ape) < os for alln > 3. (3.42) 


on’ > L, For any € > 0, define the set B = {0, 1}" \ A‘. 


2 . 
In order to prove the bound (3.42), it suffices to show that P[B] < e7™ . Since we always 
have P[B] < 5 <e-* forn > 3, it suffices to restrict our attention to € > 1. By construction, 
we have 


Consider any subset such that P[A] = = 


= mi >€. 
pulA, B) min, pula, b) = € 


Let A’ and B’ denote the subsets guaranteed by Harper’s theorem. Since A has cardinality 
at least 2”"', the set A’, which has the same cardinality as A, must contain all vectors with 
at most n/2 ones. Moreover, by the cardinality matching condition and our choice of the 
uniform distribution, we have P[B] = P[B’]. On the other hand, the set B’ is contained 
within a Hamming ball centered at the all-ones vector, and we have py(A’, B’) > € > 1. 
Consequently, any vector b € B’ must contain at least 5 + € ones. Thus, if we let {X;}'_, be 
a sequence of i.i.d. Bernoulli variables, we have P[B’] < P[ Yi, Xi 2 5 +e] < oe, where 
the final inequality follows from the Hoeffding bound. 


Since A was an arbitrary set with P[A] > L, we have shown that the concentration function 


satisfies the bound (3.42). Applying Proposition 3.11, we conclude that any 1-Lipschitz 
function on the Boolean hypercube satisfies the concentration bound 
PIX - my > e] < 2e". 


Thus, modulo the negligible difference between the mean and median (see Exercise 2.14), 
we have recovered the bounded differences inequality (2.35) for Lipschitz functions on the 
Boolean hypercube. & 


3.2.3 From geometry to concentration 


The geometric perspective suggests the possibility of a variety of connections between 
convex geometry and the concentration of measure. Consider, for instance, the Brunn- 
Minkowski inequality: in one of its formulations, it asserts that, for any two convex bodies? 
C and D in R”, we have 


[vol(AC + (1 - a)D)]!/" > Afvol(C)]!/” + (1 — A[vo(D)]!” for all A € [0,1]. (3.43) 
Here we use 
AC +(1-A)D := {Ac +(1-Ad |c €C,d € D} 


to denote the Minkowski sum of the two sets. The Brunn—Minkowski inequality and its 
variants are intimately connected to concentration of measure. To appreciate the connection, 


2 A convex body in R” is a compact and closed set. 
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observe that the concentration function (3.28) defines a notion of extremal sets—namely, 
those that minimize the measure P[A‘] subject to a constraint on the size of P[A]. Viewing 
the volume as a type of unnormalized probability measure, the Brunn—Minkowski inequal- 
ity (3.43) can be used to prove a classical result of this type: 


Example 3.14 (Classical isoperimetric inequality in R”) Consider the Euclidean sphere 
By := {x € R” ||ixll2 < 1} in R”. The classical isoperimetric inequality asserts that, for any 
set A C R” such that vol(A) = vol(B}), the volume of its e-enlargement A‘ is lower bounded 
as 


vol(A‘) > vol({B5]*), (3.44) 


showing that the ball B; is extremal. In order to verify this bound, we note that 
[vol(A‘)]'” = [vol(A + €B3)]'/" > [vol(A)]'” + [vol(eB3)]"”, 


where the lower bound follows by applying the Brunn—Minkowski inequality (3.43) with 
appropriate choices of (A, C, D); see Exercise 3.10 for the details. Since vol(A) = vol(B4) 
and [vol(eB%)]!/" = € vol(B4), we see that 


vol(A‘)'/” > (1 + €) vol(B3)!" = [vol((B3)9]'/", 
which establishes the claim. & 


The Brunn—Minkowski inequality has various equivalent formulations. For instance, it 
can also be stated as 


vol(AC + (1 — AD) > [vol(C)]*[vol(D)]'4 for all A € [0, 1]. (3.45) 


This form of the Brunn—Minkowski inequality can be used to establish Lévy-type concen- 
tration for the uniform measure on the sphere, albeit with slightly weaker constants than the 
derivation in Example 3.10. In Exercise 3.10, we explore the equivalence between inequal- 
ity (3.45) and our original statement (3.43) of the Brunn—Minkowski inequality. 


The modified form (3.45) of the Brunn—Minkowski inequality also leads naturally to a 
functional-analytic generalization, due to Prékopa and Leindler. In turn, this generalized in- 
equality can be used to derive concentration inequalities for strongly log-concave measures. 


Theorem 3.15 (Prékopa—Leindler inequality) Let u,v,w be non-negative integrable 
functions such that, for some A € [0, 1], we have 


w(Ax +(1—A)y) > [UD vo] forall x,y € R”. (3.46) 


Then 
ER 


a 
fw dx > (f u(s) ds) (ie ax} ‘ (3.47) 
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In order to see how this claim implies the classical Brunn—Minkowski inequality (3.45), 
consider the choices 


u(x) = Ic(x), v(x) = Ip(x) and w(x) = lac+a-ap(*), 


respectively. Here lc denotes the binary-valued indicator function for the event {x € C}, 
with the other indicators defined in an analogous way. In order to show that the classical 
inequality (3.45) follows as a consequence of Theorem 3.15, we need to verify that 


lac+a-ayp(Ax + (1 — Ay) = [Ie(x) p(y for all x,y € R”. 


For 2 = 0 or A = 1, the claim is immediate. For any 4 € (0, 1), if either x ¢ C or y ¢ D, the 
right-hand side is zero, so the statement is trivial. Otherwise, if x € C and y € D, then both 
sides are equal to one. 


The Prékopa—Leindler inequality can be used to establish some interesting concentration 
inequalities of Lipschitz functions for a particular subclass of distributions, one which allows 
for some dependence. In particular, we say that a distribution P with a density p (with 
respect to the Lebesgue measure) is a strongly log-concave distribution if the function log p 
is strongly concave. Equivalently stated, this condition means that the density can be written 
in the form p(x) = exp(—w(x)), where the function y: R” — R is strongly convex, meaning 
that there is some y > 0 such that 


AYW(x) + 1 =- DYO) — WAx + (1 = 4y) 2 z ACL = A) lix = yll (3.48) 


for all A € [0, 1], and x,y € R”. For instance, it is easy to verify that the distribution of a stan- 
dard Gaussian vector in n dimensions is strongly log-concave with parameter y = 1. More 
generally, any Gaussian distribution with covariance matrix X > 0 is strongly log-concave 
with parameter y = Ymin(2~!) = (Ymax(X))~!. In addition, there are a variety of non-Gaussian 
distributions that are also strongly log-concave. For any such distribution, Lipschitz func- 
tions are guaranteed to concentrate, as summarized in the following: 


Theorem 3.16 Let P be any strongly log-concave distribution with parameter y > 0. 
Then for any function f: R” — R that is L-Lipschitz with respect to Euclidean norm, 
we have 


PISCO- EON > < 2622. (3.49) 


Remark: Since the standard Gaussian distribution is log-concave with parameter y = 1, this 
theorem implies our earlier result (Theorem 2.26), albeit with a sub-optimal constant in the 
exponent. 


Proof Let h be an arbitrary zero-mean function with Lipschitz constant L with respect to 


2 
the Euclidean norm. It suffices to show that E[e"] < e7. Indeed, if this inequality holds, 
then, given an arbitrary function f with Lipschitz constant K and 2 € R, we can apply 
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this inequality to the zero-mean function h := A(f — E[f(X)]), which has Lipschitz constant 
L = AK. Doing so yields the bound 


F [e CO-ED] < er for all A €R, 


which shows that f(X) -— E[f(X)] is a sub-Gaussian random variable. As shown in Chapter 2, 
this type of uniform control on the moment generating function implies the claimed tail 
bound. 

Accordingly, for a given zero-mean function A that is L-Lipschitz and for given 4 € (0, 1) 
and x, y € R”, define the function 


AR Cay eel 
go) := inf {ni E Ilx vi} 


known as the inf-convolution of h with the rescaled Euclidean norm. With this definition, 
the proof is based on applying the Prékopa—Leindler inequality with 2 = 1/2 to the triplet 
of functions w(z) = p(z) = exp(—YW(z)), the density of P, and the pair of functions 


u(x) := exp(—h(x) — Y(x)) and v(y) := exp(g(y) - Wy). 


We first need to verify that the inequality (3.46) holds with 2 = 1/2. By the definitions of u 
and v, the logarithm of the right-hand side of inequality (3.46)—call it R for short—is given 
by 


R = H8O) — A(x)} - iy - Sy) = Hgo) — A) - 2E(x, y)} - W(x/2 + y/2), 


where E(x, y) := 5W(x) + iyo) — W(x/2 + y/2). Since P is a y-log-concave distribution, the 
function y is y strongly convex, and hence 2E(x, y) > x lx- yll. Substituting into the earlier 
representation of R, we find that 


1 
R< feo) n(x) - 5 x vi} Wx/2 +y/2) < -Wal2 + y/2), 


where the final inequality follows from the definition of the inf-convolution g. We have thus 
verified condition (3.46) with A = 1/2. 

Now since f w(x)dx = f p(x)dx = 1 by construction, the Prékopa—Leindler inequality 
implies that 


0> zlog f ereo dx + slog f eoo dy. 


Rewriting the integrals as expectations and rearranging yields 


1 @ 1 (ii) 
Ffo) = 
[e ] < = [em 7 < ELA — 1, (3.50) 


where step (i) follows from Jensen’s inequality, and convexity of the function t + exp(—d), 
and step (ii) uses the fact that L[—h(X)] = 0 by assumption. Finally, since h is an L-Lipschitz 
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function, we have |h(x) — h(y)| < L|lx — yllo, and hence 


g(y) = inf {noo ttik - 8} > hQ) + inf { Lllx — yll + = Ix vig} 
xeR” 4 xeR” 4 


12 
= hy) - =. 
Y 


Combined with the bound (3.50), we conclude that E[e””] < exp(£), as claimed. 


3.3 Wasserstein distances and information inequalities 


We now turn to the topic of Wasserstein distances and information inequalities, also known 
as transportation cost inequalities. On one hand, the transportation cost approach can be 
used to obtain some sharp results for Lipschitz functions of independent random variables. 
Perhaps more importantly, it is especially well suited to certain types of dependent random 
variables, such as those arising in Markov chains and other types of mixing processes. 


3.3.1 Wasserstein distances 


We begin by defining the notion of a Wasserstein distance. Given a metric space (X,/), a 
function f: X — R is L-Lipschitz with respect to the metric p if 


f(x) — f < Lox, x’) for all x, x’ € X, (3.51) 


and we use ||f||Lip to denote the smallest L for which this inequality holds. Given two prob- 
ability distributions Q and P on X, we can then measure the distance between them via 


W,(Q, P)= sup | fra- f a|, 6.52) 
IfllLip <1 

where the supremum ranges over all 1-Lipschitz functions. This distance measure is referred 

to as the Wasserstein metric induced by p. It can be verified that, for each choice of the metric 

p, this definition defines a distance on the space of probability measures. 


Example 3.17 (Hamming metric and total variation distance) Consider the Hamming met- 
ric p(x, x’) = I[x + x’]. We claim that, in this case, the associated Wasserstein distance is 
equivalent to the total variation distance 


IQ - Pliry = sup le) -= P(A)], (3.53) 


where the supremum ranges over all measurable subsets A. To see this equivalence, note that 
any function that is 1-Lipschitz with respect to the Hamming distance satisfies the bound 
fœ) — f(x’)| < 1. Since the supremum (3.52) is invariant to constant offsets of the function, 
we may restrict the supremum to functions such that f(x) € [0,1] for all x € X, thereby 
obtaining 


Win( P)= sup. T f (dQ - dP) 2 IQ - Plhy, 


f: X> (0, 
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where equality (i) follows from Exercise 3.13. 
In terms of the underlying densities? p and q taken with respect to a base measure v, we 
can write 


1 
Wram(Q, P) = IIQ - Pllry = J [toc — q(x)lv (dx), 


corresponding to (one half) the L'(v)-norm between the densities. Again, see Exercise 3.13 
for further details on this equivalence. A 


By a classical and deep result in duality theory (see the bibliographic section for details), 
any Wasserstein distance has an equivalent definition as a type of coupling-based distance. 
A distribution M on the product space X ® X is a coupling of the pair (Q, P) if its marginal 
distributions in the first and second coordinates coincide with Q and P, respectively. In order 
to see the relation to the Wasserstein distance, let f: X — R be any 1-Lipschitz function, 
and let M be any coupling. We then have 


i p(x, x)dM(x, x) 8 J (FG) — reen { f (dP - dQ, (3.54) 


where the inequality (i) follows from the 1-Lipschitz nature of f, and the equality (ii) follows 
since M is a coupling. The Kantorovich—Rubinstein duality guarantees the following impor- 
tant fact: if we minimize over all possible couplings, then this argument can be reversed, and 
in fact we have the equivalence 


sup { f (dQ — dP) = inf T p(x, x) dM(x, x’) = inf Eyfo(X, X^], (3.55) 
M IXxx m 


IfllLip <1 
W, (P, Q) 


where the infimum ranges over all couplings M of the pair (P, Q). This coupling-based rep- 
resentation of the Wasserstein distance plays an important role in many of the proofs to 
follow. 

The term “transportation cost” arises from the following interpretation of coupling-based 
representation (3.55). For concreteness, let us consider the case where P and Q have den- 
sities p and q with respect to Lebesgue measure on X, and the coupling M has density m 
with respect to Lebesgue measure on the product space. The density p can be viewed as 
describing some initial distribution of mass over the space X, whereas the density q can be 
interpreted as some desired distribution of the mass. Our goal is to shift mass so as to trans- 
form the initial distribution p to the desired distribution q. The quantity p(x, x’)dxdx’ can 
be interpreted as the cost of transporting a small increment of mass dx to the new increment 
dx’. The joint distribution m(x, x’) is known as a transportation plan, meaning a scheme for 
shifting mass so that p is transformed to g. Combining these ingredients, we conclude that 
the transportation cost associated with the plan m is given by 


i p(x, x’)m(x, x’) dx dx’, 
XXX 


and minimizing over all admissible plans—that is, those that marginalize down to p and q, 


3 This assumption entails no loss of generality, since P and Q both have densities with respect to v = (P +Q). 
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respectively—yields the Wasserstein distance. 


3.3.2 Transportation cost and concentration inequalities 


Let us now turn to the notion of a transportation cost inequality, and its implications for the 
concentration of measure. Transportation cost inequalities are based on upper bounding the 
Wasserstein distance W,(Q, P) in terms of the Kullback—Leibler (KL) divergence. Given two 
distributions Q and P, the KL divergence between them is given by 


r a| log a | when Q is absolutely continuous with respect to P, (3.56) 


+00 otherwise. 


D(QII P) -| 


If the measures have densities* with respect to some underlying measure y—say q and p— 
then the Kullback—Leibler divergence can be written in the form 


(x) 

DQIIP) = f: g(x) log Ly (dx). (3.57) 
x P(x) 

Although the KL divergence provides a measure of distance between distributions, it is not 

actually a metric (since, for instance, it is not symmetric in general). 


We say that a transportation cost inequality is satisfied when the Wasserstein distance is 
upper bounded by a multiple of the square-root KL divergence. 


Definition 3.18 For a given metric p, the probability measure P is said to satisfy a 
p-transportation cost inequality with parameter y > 0 if 


W,(Q, P) < ¥2yD(Q||P) (3.58) 


for all probability measures Q. 


d 


Such results are also known as information inequalities, due to the role of the Kullback— 
Leibler divergence in information theory. A classical example of an information inequality 
is the Pinsker—Csiszdr—Kullback inequality, which relates the total variation distance with 
the KL divergence. More precisely, for all probability distributions P and Q, we have 


IP — Qllrv < 44DQ ||P). (3.59) 


From our development in Example 3.17, this inequality corresponds to a transportation 
cost inequality, in which y = 1/4 and the Wasserstein distance is based on the Hamming 
norm p(x, x’) = I[x + x’]. As will be seen shortly, this inequality can be used to recover 
the bounded differences inequality, corresponding to a concentration statement for functions 
that are Lipschitz with respect to the Hamming norm. See Exercise 15.6 in Chapter 15 for 


4 Inthe special case of a discrete space X, and probability mass functions q and p, we have D(Q || P) = 
(x 


Dixex q(x) log Pa $ 
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the proof of this bound. 


By the definition (3.52) of the Wasserstein distance, the transportation cost inequality 
(3.58) can be used to upper bound the deviation f faQ- f f dP in terms of the Kullback— 
Leibler divergence D(Q || P). As shown by the following result, a particular choice of dis- 
tribution Q can be used to derive a concentration bound for f under P. In this way, a trans- 
portation cost inequality leads to concentration bounds for Lipschitz functions: 


me 
Theorem 3.19 (From transportation cost to concentration) Consider a metric measure 
space (P, X, p), and suppose that P satisfies the p-transportation cost inequality (3.58). 
Then its concentration function satisfies the bound 
2 
a Px p)(t) < 2exp (-5) 5 (3.60) 
Moreover, for any X ~ P and any L-Lipschitz function f : X — R, we have the concen- 
tration inequality 
2 
PIS) - ELF] = f < 2exp (l-z) (3.61) 
yL 
€ ) 
Remarks: By Proposition 3.11, the bound (3.60) implies that 
2 
PIX) -mpl = t] < 2exp (l-5) (3.62) 


where m, is any median of f. In turn, this bound can be used to establish concentration 
around the mean, albeit with worse constants than the bound (3.61). (See Exercise 2.14 for 
details on this equivalence.) In our proof, we make use of separate arguments for the median 
and mean, so as to obtain sharp constants. 


Proof We begin by proving the bound (3.60). For any set A with P[A] > 1/2 and a given 
e€ > 0, consider the set 


B := (AS = {y E X | plx,y)26€ VxeEA}. 


If P(A‘) = 1, then the proof is complete, so that we may assume that P(B) > 0. 

By construction, we have p(A, B) := inf,ea infyeg p(x, y) = €. On the other hand, let P4 
and Pg, denote the distributions of P conditioned on A and B, and let M denote any cou- 
pling of this pair. Since the marginals of M are supported on A and B, respectively, we 
have p(A, B) < f p(x, x’) dM(x, x’). Taking the infimum over all couplings, we conclude that 
e < p(A, B) < W,(Pa, Pa). 

Now applying the triangle inequality, we have 


i (ii) 
e < WPa, Pr) < W,(P, Pa) + WAP, Pe) < VyD(PallP) + VyD(PallP) 


Gii) 
< y2y{D(P4\|P) + D(P; || P)}’?, 
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where step (ii) follows from the transportation cost inequality, and step (iii) follows from the 
inequality (a + b} < 2a? + 2b’. 

It remains to compute the Kullback—Leibler divergences. For any measurable set C, we 
have P4(C) = P(C NA)/P(A), so that D(P, || P) = log Pa Similarly, we have D(P g || P) = 
log rot Combining the pieces, we conclude that 


1 
e < 2yflog(1/P(A)) + log(1/P(B))} = aioe (eT DF 5) 


or equivalently P(A)P(B) < exp C£). Since P(A) > 1/2 and B = (A‘)‘°, we conclude that 


P(A‘) > 1 - 2exp (£). Since A was an arbitrary set with P(A) > 1/2, the bound (3.60) 
follows. 


We now turn to the proof of the concentration statement (3.61) for the mean. If one is not 
concerned about constants, such a bound follows immediately by combining claim (3.60) 
with the result of Exercise 2.14. Here we present an alternative proof with the dual goals of 
obtaining the sharp result and illustrating a different proof technique. Throughout this proof, 
we use Eg[f] and Ep[f] to denote the mean of the random variable f(X) when X ~ Q and 
X ~ P, respectively. We begin by observing that 


he f(dQ—dP) Ê LW,(Q, P) Ê V2EyDQIP), 


where step (i) follows from the L-Lipschitz condition on f and the definition (3.52); and 
step (ii) follows from the information inequality (3.58). For any positive numbers (u, v, A), 
we have V2uv < $2 + +. Applying this inequality with u = L’y and v = D(Q || P) yields 


wE 1 
fra - dP) < m + -DQIIP), (3.63) 


valid for all A > 0. 

Now define a distribution Q with Radon—Nikodym derivative B(x) = & /Epfes], 
where g(x) := A( f(x) — Ep(f)) - E, (Note that our proof of the bound (3.61) ensures that 
E p[e®™] exists.) With this choice, we have 


8%) 1222 
DQ IIP) = Elos (z a] = A{EQ(f(X)) - Er X) - — — log Ep[e”]. 


Combining with inequality (3.63) and performing some algebra (during which the reader 
should recall that 2 > 0), we find that log Ep[e’™] < 0, or equivalently 


ay? 


j p [e2 FO-Er FOOD] <e.. 


The upper tail bound thus follows by the Chernoff bound. The same argument can be applied 
to — f, which yields the lower tail bound. 


3.3.3 Tensorization for transportation cost 


Based on Theorem 3.19, we see that transportation cost inequalities can be translated into 
concentration inequalities. Like entropy, transportation cost inequalities behave nicely for 
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product measures, and can be combined in an additive manner. Doing so yields concentra- 
tion inequalities for Lipschitz functions in the higher-dimensional space. We summarize in 
the following: 


Proposition 3.20 Suppose that, for each k = 1,2,...,n, the univariate distribution 
P, satisfies a p~x-transportation cost inequality with parameter yg. Then the product 
distribution P = C4 P, satisfies the transportation cost inequality 


k=1 


W,(Q, P) < | | oF rn} DQ ||P) forall distributions Q, (3.64) 


where the Wasserstein metric is defined using the distance p(x, y) := Xg- Pk(Xks Ye): 


Before turning to the proof of Proposition 3.20, it is instructive to see how, in conjunction 
with Theorem 3.19, it can be used to recover the bounded differences inequality. 


Example 3.21 (Bounded differences inequality) Suppose that f satisfies the bounded dif- 
ferences inequality with parameter L; in coordinate k. Then using the triangle inequality 
and the bounded differences property, it can be verified that f is a 1-Lipschitz function with 
respect to the rescaled Hamming metric 


P(x, y) := DILES where p(Xk, Yk) = Ly Van F Yel. 
k=l 


By the Pinsker—Csiszar—Kullback inequality (3.59), each univariate distribution P% satisfies 
2 

a p,-transportation cost inequality with parameter yg = = so that Proposition 3.20 implies 

that P = Q); Px satisfies a p-transportation cost inequality with parameter y := + Xg; L2. 


Since f is 1-Lipschitz with respect to the metric p, Theorem 3.19 implies that 


2? 
PIFO - ELFI > t] < 2exp (- : :). (3.65) 
dia Lj 
In this way, we recover the bounded differences inequality from Chapter 2 from a transporta- 
tion cost argument. 4 


Our proof of Proposition 3.20 is based on the coupling-based characterization (3.55) of 
Wasserstein distances. 


Proof Letting Q be an arbitrary distribution over the product space X”, we construct a 
coupling M of the pair (P, Q). For each j = 2,...,n, let M’ denote the joint distribution over 
the pair (x! i y/ ) = (X%,...,Xj,V1,..., Y;), and let M; j-ı denote the conditional distribution 
of (X;, Y;) given (Css Yi’). By the dual representation (3.55), we have 


WQ, P) < En, [01X1 YD] + >) Eye Enya lo; YI 


j=2 
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where M; denotes the marginal distribution over the pair (X;, Y;). We now define our cou- 
pling M in an inductive manner as follows. First, choose M, to be an optimal coupling of the 
pair (P1, Q1), thereby ensuring that 


En, [01 (X1, Y] 2 W,(Q), P) È VIDO IP, 


where equality (1) follows by the optimality of the coupling, and inequality (ii) follows from 
the assumed transportation cost inequality for P;. Now assume that the joint distribution 
over Cr, y7 -1) has been defined. We choose conditional distribution M jC To yr") to 


be an optimal coupling for the pair (Pj, Qj j-1¢ | y7’), thereby ensuring that 


EM jj- [p(X;, Y)] S V2Y;DQ jj- | yl) | Pj), 


valid for each y Taking averages over yi with respect to the marginal distribution 
ree or, equivalently, the marginal Q7 '—the concavity of the square-root function and 
Jensen’s inequality implies that 


Ey Ena lX YD < 2y Eg D Qa YFP). 


Combining the ingredients, we obtain 


WQ, P) < V2 DQUIPI+ >, V2v Eg (Qi YIP DI 
j=2 
: HÈ v) pe IP) + >) Eg DQ YP IPD 
j=2 


j=l 


= | | y v) ILP), 


j=1 


where step (i) by follows the Cauchy—Schwarz inequality, and equality (ii) uses the chain 
rule for Kullback—Leibler divergence from Exercise 3.2. 


In Exercise 3.14, we sketch out an alternative proof of Proposition 3.20, one which makes 
direct use of the Lipschitz characterization of the Wasserstein distance. 


3.3.4 Transportation cost inequalities for Markov chains 


As mentioned previously, the transportation cost approach has some desirable features in ap- 
plication to Lipschitz functions involving certain types of dependent random variables. Here 
we illustrate this type of argument for the case of a Markov chain. (See the bibliographic 
section for references to more general results on concentration for dependent random vari- 
ables.) 

More concretely, let (X1, . . . , X„) be a random vector generated by a Markov chain, where 
each X; takes values in a countable space X. Its distribution P over X” is defined by an initial 
distribution X; ~ P4, and the transition kernels 


Kaaa | Xi) = Pint (Xii = Xia | Xi = xi). (3.66) 
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Here we focus on discrete state Markov chains that are G-contractive, meaning that there 
exists some £ € [0, 1) such that 
max sup |IKisi¢ | x) — Kins | xpllrv <£, (3.67) 


i=1,....n-1 xi, 


where the total variation norm (3.53) was previously defined. 


Theorem 3.22 Let P be the distribution of a B-contractive Markov chain (3.67) over 
the discrete space X". Then for any other distribution Q over X", we have 


1 n 
W,(Q, P) < 1-3 V2 D(Q II P), (3.68) 


where the Wasserstein distance is defined with respect to the Hamming norm p(x, y) = 
Diet lbe # yil. 


Remark: See the bibliography section for references to proofs of this result. Using The- 
orem 3.19, an immediate corollary of the bound (3.68) is that for any function f: X” — R 
that is L-Lipschitz with respect to the Hamming norm, we have 
21 - pyr 
nL? ` 


(3.69) 


PIF - ELFO > 4 < 2exp(- 


Note that this result is a strict generalization of the bounded difference inequality for inde- 
pendent random variables, to which it reduces when £ = 0. 


Example 3.23 (Parameter estimation for a binary Markov chain) Consider a Markov chain 
over binary variables X; € {0, 1} specified by an initial distribution P, that is uniform, and 


the transition kernel 
1 . 
s(1 +ô) if x; = Xi, 
Kiii | Xi) = 7 ; : 
311-6) if xmi £ Xi 


where ô € [0, 1] is a “stickiness” parameter. Suppose that our goal is to estimate the param- 
eter 6 based on an n-length vector (X,,...,X,,) drawn according to this chain. An unbiased 
estimate of iq + ô) is given by the function 


1 n-1 
FXn X) = — 2, IX; = Xn], 


corresponding to the fraction of times that successive samples take the same value. We claim 
that f satisfies the concentration inequality 


(n-1)20-6)22 _ @-pd-62? 
Qn < 2e 4 g 


PFX) - 40. +8) = A < 2e (3.70) 


Following some calculation, we find that the chain is 8-contractive with 6 = 6. More- 
over, the function f is -2 -Lipschitz with respect to the Hamming norm. Consequently, the 
bound (3.70) follows as a consequence of our earlier general result (3.69). & 
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3.3.5 Asymmetric coupling cost 


Thus far, we have considered various types of Wasserstein distances, which can be used to 
obtain concentration for Lipschitz functions. However, this approach—as with most meth- 
ods that involve Lipschitz conditions with respect to £1-type norms—typically does not yield 
dimension-independent bounds. By contrast, as we have seen previously, Lipschitz condi- 
tions based on the f-norm often do lead to dimension-independent results. 

With this motivation in mind, this section is devoted to consideration of another type of 
coupling-based distance between probability distributions, but one that is asymmetric in its 
two arguments, and of a quadratic nature. In particular, we define 


C(Q, P) := inf J dew # x; |X; = xj) dP, (3.71) 


where once again the infimum ranges over all couplings M of the pair (P, Q). This distance 
is relatively closely related to the total variation distance; in particular, it can be shown that 
an equivalent representation for this asymmetric distance is 


a dQ of 
C(Q,P) = vi h- Bo] aren, (3.72) 


where t} := max{0, t}. We leave this equivalence as an exercise for the reader. This repre- 
sentation reveals the close link to the total variation distance, for which 


IP - Qlhy = f R| Pw =2 { a 


t= Ls 
dP dP 
An especially interesting aspect of the asymmetric coupling distance is that it satisfies a 
Pinsker-type inequality for product distributions. In particular, given any product distribution 
P in n variables, we have 


dP(x). 


max{C(Q, P), C(P, Q)} < ¥2D(Q|| P) (3.73) 


for all distributions Q in n dimensions. This deep result is due to Samson; see the biblio- 
graphic section for further discussion. While simple to state, it is non-trivial to prove, and 
has some very powerful consequences for the concentration of convex and Lipschitz func- 
tions, as summarized in the following: 


Theorem 3.24 Consider a vector of independent random variables (X,,..., Xn), each 
taking values in [0,1], and let f: R” — R be convex, and L-Lipschitz with respect to 
the Euclidean norm. Then for all t = 0, we have 


PIS - ESON 2 A < 26°. (8.74) 


Remarks: Note that this is the analog of Theorem 2.26—namely, a dimension-independent 
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form of concentration for Lipschitz functions of independent Gaussian variables, but formu- 
lated for Lipschitz and convex functions of bounded random variables. 

Of course, the same bound also applies to a concave and Lipschitz function. Earlier, we 
saw that upper tail bounds can obtained under a slightly milder condition, namely that of 
separate convexity (see Theorem 3.4). However, two-sided tail bounds (or concentration 
inequalities) require these stronger convexity or concavity conditions, as imposed here. 


Example 3.25 (Rademacher revisited) As previously introduced in Example 3.5, the Rade- 
macher complexity of a set A c R” is defined in terms of the random variable 


n 
Z = Z(E€1,..., En) := sup X ae, 
acA k=l 


where {ez}; is an i.i.d. sequence of Rademacher variables. As shown in Example 3.5, the 
function (€1,...,&,)  Z(€),...,€,) is jointly convex, and Lipschitz with respect to the 
Euclidean norm with parameter W(A) := sup, <a lla|l2. Consequently, Theorem 3.24 implies 


that 


B 
P[|Z — E[Z]| > t] < 2 -—.— ]. 3.75 
[IZ - EZ > 4 exp| ol (3.75) 
Note that this bound sharpens our earlier inequality (3.17), both in terms of the exponent 
and in providing a two-sided result. & 


Let us now prove Theorem 3.24. 


Proof As defined, any Wasserstein distance immediately yields an upper bound on a quan- 
tity of the form f f(dQ — dP), where f is a Lipschitz function. Although the asymmetric 
coupling-based distance is not a Wasserstein distance, the key fact is that it can be used to 
upper bound such differences when f: [0,1]" — R is Lipschitz and convex. Indeed, for a 
convex f, we have the lower bound f(x) > f(y) + (Vf(), x — y), which implies that 


oF 
ay, 


fo) - f@) < >, Ix; # yj]. 
jel 


Here we have also used the fact that |x; — y;| < l[x; # y;] for variables taking values in the 
unit interval [0, 1]. Consequently, for any coupling M of the pair (P, Q), we have 


f soraao- f fædda 3 
Š Í > 


< f IV FOl | XL MPIX; # y; | Y; = y; dQO), 
j=l 


lix; # yj]dM(x, y) 


of 
ay: (y) 


0 
Lo) MIX, + y; | Y; = yj1dQO) 
Yj 


where we have applied the Cauchy—Schwarz inequality. By the Lipschitz condition and con- 
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vexity, we have ||V.f(y)|l2 < L almost everywhere, and hence 


n 1/2 
{ fO) dQ) - { f(x) dP(x) < L i o MIX; # yj 1¥; =y} dQy) 
jel 


i 1/2 
ips MIX; +y; |Y; =y] 0) 
j=l 


= LC(P, Q). 


<L 


Consequently, the upper tail bound follows by a combination of the information inequal- 
ity (3.73) and Theorem 3.19. 

To obtain the lower bound for a convex Lipschitz function, it suffices to establish an upper 
bound for a concave Lipschitz function, say g: [0,1]" — R. In this case, we have the upper 
bound 


Og(x) 
T lix; # yj], 


gO) < g(x) + (Vg, y- x) tY 
jJEl 
and consequently 


Jeno- [earors >) 


The same line of reasoning then shows that f g dQ) - f gdP(x) < LC(Q, P), from which 
the claim then follows as before. 


ðg(x) 
—— lix; + y]dM(x,y). 


We have stated Theorem 3.24 for the familiar case of independent random variables. How- 
ever, a version of the underlying information inequality (3.73) holds for many collections of 
random variables. In particular, consider an n-dimensional distribution P for which there 
exists some y > 0 such that the following inequality holds: 


max{C(Q, P), C(P, Q)} < y2yD(Q|| P) for all distributions Q. (3.76) 


The same proof then shows that any L-Lipschitz function satisfies the concentration inequal- 
ity 


2 
PLAX) - ELF] = t] < 2exp (l-z) ; (3.77) 
yL 
For example, for a Markov chain that satisfies the 6-contraction condition (3.67), it can be 
shown that the information inequality (3.76) holds with y = (a Consequently, any 
L-Lipschitz function (with respect to the Euclidean norm) of a 8-contractive Markov chain 
satisfies the concentration inequality 


(3.78) 


= er) 


PUSCO) - ELfCO]l = ts 2exp(- JE 


This bound is a dimension-independent analog of our earlier bound (3.69) for a contractive 
Markov chain. We refer the reader to the bibliographic section for further discussion of 
results of this type. 
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3.4 Tail bounds for empirical processes 


In this section, we illustrate the use of concentration inequalities in application to empirical 
processes. We encourage the interested reader to look ahead to Chapter 4 so as to acquire 
the statistical motivation for the classes of problems studied in this section. Here we use the 
entropy method to derive various tail bounds on the suprema of empirical processes—in par- 
ticular, for random variables that are generated by taking suprema of sample averages over 
function classes. More precisely let F be a class of functions (each of the form f: X > R), 
and let (X,,...,X,,) be drawn from a product distribution P = Q P;, where each P; is 
supported on some set X; € X. We then consider the random variable’ 


= z=sp{ 2 saa} (3.79) 


The primary goal of this section is to derive a number of upper bounds on the tail event 
{Z > E[Z] + ô}. 

As a passing remark, we note that, if the goal is to obtain bounds on the random vari- 
E z i f AG D|, then it can be reduced to an instance of the variable (3.79) by 


odda: the augmented function class F = F U {-F}. 


3.4.1 A functional Hoeffding inequality 


We begin with the simplest type of tail bound for the random variable Z, namely one of the 
Hoeffding type. The following result is a generalization of the classical Hoeffding theorem 
for sums of bounded random variables. 


Theorem 3.26 (Functional Hoeffding theorem) For each f € F andi=1,. 
assume that there are real numbers aj < bip such that f(x) € es bis] for all x € ae 
Then for all 6 = 0, we have 


nô? 
PIZ > E[Z] + 6] < exp (-7). (3.80) 


where L? := SUP pee {+ SLi is — ai)"}- 
g ) 


Remark: Ina very special case, Theorem 3.26 can be used to recover the classical Hoeffding 
inequality in the case of bounded random variables, albeit with a slightly worse constant. 
Indeed, if we let F be a singleton consisting of the identity function f(x) = x, then we have 
Z= 1 > X;. Consequently, as long as x; € [a;, bi], Theorem 3.26 implies that 


1 n 
el: 2 - E[Xi]) > ô 


> Note that there can be measurability problems associated with this definition if ¥ is not countable. See the 
bibliographic discussion in Chapter 4 for more details on how to resolve them. 


<e nae 


88 Concentration of measure 


where L? = 1 ? (bi — ai}. We thus recover the classical Hoeffding theorem, although the 
constant 1/4 in the exponent is not optimal. 

More substantive implications of Theorem 3.26 arise when it is applied to a larger function 
class F. In order to appreciate its power, let us compare the upper tail bound (3.80) to the 
corresponding bound that can be derived from the bounded differences inequality, as applied 
to the function (x1, ..., Xn)  Z(x,...,X,). With some calculation, it can be seen that this 
function satisfies the bounded difference inequality with constant L; := sup FEF lb; s — a;l 
in coordinate i. Consequently, the bounded differences method (Corollary 2.21) yields a 
sub-Gaussian tail bound, analogous to the bound (3.80), but with the parameter 


n 


=~ 1 
L=- sup(b;f — ai %. 
y Am os 


Note that the quantity L—since it is defined by applying the supremum separately to each 
coordinate—can be substantially larger than the constant L defined in the theorem statement. 


Proof It suffices to prove the result for a finite class of functions F; the general result can 
be recovered by taking limits over an increasing sequence of such finite classes. Let us view 
Z as a function of the random variables (X,,...,X,,). For each index j = 1,...,n, define the 
random function 


Xj > Z;(x;) = Z(X1, eee s Xj- Xj, X jut, wee Xn). 
In order to avoid notational clutter, we work throughout this proof with the unrescaled ver- 


sion of Z, namely Z = sup ez Lij-1 f(Xi). Combining the tensorization Lemma 3.8 with the 
bound (3.20a) from Lemma 3.7, we obtain 


He) < PE » EZX) = ZX)? UZ(X) = ZAN eS | XVI). BBD 


J=l 


For each f € F, define the set A(f) := {(x1,...,%) € R” | Z = XL, f(x}, corresponding 
to the set of realizations for which the maximum defining Z is achieved by f. (If there are 
ties, then we resolve them arbitrarily so as to make the sets A(f) disjoint.) For any x € A(f), 
we have 


Zia) -Z= fl) +) fa- max{ Fop +> Fæ) < fæ) - fO). 
i¢j £ i#j 


As long as Z;(x;) => Z;(y;), this inequality still holds after squaring both sides. Considering 
all possible sets A(f), we arrive at the upper bound 


(Zæ) - Ziv P UZ (x) = ZO Y I E ANICE- FOP". 882 


SJEF 
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Since (f(x;) — f(y)” < (bj ¢ — a;r) by assumption, summing over the indices j yields 


Zip - ZY? UZ) = Zoe < Y Wx € AM Yen = anne 


j=l heF k=l 
n 
< sup Seis - a; p e70 
SJEF j=l 
Sn, 


Substituting back into our earlier inequality (3.81), we find that 


Hle) < nL? Eje]. 


This is a sub-Gaussian entropy bound (3.5) with œ = V2n L, so that Proposition 3.2 implies 
that the unrescaled version of Z satisfies the tail bound 


PIZ > EIZ] + t] < e7. 


Setting t = nô yields the claim (3.80) for the rescaled version of Z. 


3.4.2 A functional Bernstein inequality 


In this section, we turn to the Bernstein refinement of the functional Hoeffding inequality 
from Theorem 3.26. As opposed to control only in terms of bounds on the function values, 
it also brings a notion of variance into play. As will be discussed at length in later chapters, 
this type of variance control plays a key role in obtaining sharp bounds for various types of 
statistical estimators. 


Theorem 3.27 (Talagrand concentration for empirical processes) Consider a count- 
able class of functions F uniformly bounded by b. Then for all 6 > 0, the random 
variable (3.79) satisfies the upper tail bound 


—nô? 
P[Z > E[Z] +6] < dex ore) (3.83) 


where X? = SUP fee 1 DT X): 
L J 


In order to obtain a simpler bound, the expectation E[X*] can be upper bounded. Using 
symmetrization techniques to be developed in Chapter 4, it can be shown that 


E[D"] < o° + 2bE[Z], (3.84) 


where o° = sup sez H f°(X)]. Using this upper bound on E[Z”] and performing some alge- 


bra, we obtain that there are universal positive constants (co, c1) such that 


PIZ>E[Z]+coyvVi+cbt]<e" forallt>0, (3.85) 
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where y? = o? + 2bE[Z]. See Exercise 3.16 for the derivation of this inequality from The- 
orem 3.27 and the upper bound (3.84). Although the proof outlined here leads to poor con- 
stants, the best known are cy = V2 and cı = 1/3; see the bibliographic section for further 
details. 

In certain settings, it can be useful to exploit the bound (3.85) in an alternative form: in 
particular, for any € > 0, it implies the upper bound 


PIZ > (1+ SEIZ] + coo Vt + (c1 + ch/o©bt] < e™. (3.86) 


Conversely, we can recover the tail bound (3.85) by optimizing over e > 0 in the family of 
bounds (3.86); see Exercise 3.16 for the details of this equivalence. 


Proof We assume without loss of generality that b = 1, since the general case can be 
reduced to this one. Moreover, as in the proof of Theorem 3.26, we work with the unrescaled 
version—namely, the variable Z = sup z 1; f(Xi)—and then translate our results back. 
Recall the definition of the sets A(f), and the upper bound (3.82) from the previous proof; 
substituting it into the entropy bound (3.81) yields the upper bound 


He”) < 2° E È F | X e E APAFA) - fY e7 | xv] 


j=1 SJEF 
Now we have 
X YIX € AIF) - SOD} < 2sup X fX) +2 sup X fY) 
i=l fEF SEF “Z| SEF Gay 
= 2{1(X) + TO}, 


where I'(X) := supreg Mi f °(X;) is the unrescaled version of 2°. Combined with our earlier 
inequality, we see that the entropy satisfies the upper bound 


Hle”) < 2a*{E[Pe*7] + Ef] Ele}. (3.87) 


From the result of Exercise 3.4, we have H(e*7*”) = e*°H(e’“) for any constant c € R. Since 
the right-hand side also contains a term e°% in each component, we see that the same upper 
bound holds for H(e*7), where Z = Z — E[Z] is the centered version. We now introduce a 
lemma to control the term E[Te%]. 


Lemma 3.28 (Controlling the random variance) For all A > 0, we have 


Ere] < (e — DEV] Efe2] + E[Ze*2]. (3.88) 


Combining the upper bound (3.88) with the entropy upper bound (3.87) for Z, we obtain 


Hle”) < A2(2e E[P]e(a) + 2y'(A)} forall A> 0, 


where y(A) := E [e2] is the moment generating function of Z. Since E [Z] = 0, we recog- 
nize this as an entropy bound of the Bernstein form (3.10) with b = 2 and o? = 2e E[T]. 
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Consequently, by the consequence (3.12) stated following Proposition 3.3, we conclude that 
6 

8e ERT] + 46 

Recalling the definition of T and rescaling by 1/n, we obtain the stated claim of the theorem 

with b = 1. 

It remains to prove Lemma 3.28. Consider the function g(t) = e' with conjugate dual 
g(s) = slogs — s for s > 0. By the definition of conjugate duality (also known as Young’s 
inequality), we have st < slogs — s + e' for all s > 0 and ż € R. Applying this inequality 
with s = e% and t =T — (e — 1)E[I] and then taking expectations, we find that 


P[Z > E[Z] + 6] < exp (- ) for all 6 > 0. 


Ee?) — (e - 1)E[e”] EI] < aE[Ze"”] - Ele] + Efe PE], 


Note that T is defined as a supremum of a class of functions taking values in [0, 1]. Therefore, 
by the result of Exercise 3.15, we have E[e™?"")] < 1. Moreover, by Jensen’s inequality, 
we have E[e*7] > e**!4] = 1, Putting together the pieces yields the claim (3.88). 


3.5 Bibliographic details and background 


Concentration of measure is an extremely rich and deep area with an extensive literature; we 
refer the reader to the books by Ledoux (2001) and Boucheron et al. (2013) for more com- 
prehensive treatments. Logarithmic Sobolev inequalities were introduced by Gross (1975) 
in a functional-analytic context. Their dimension-free nature makes them especially well 
suited for controlling infinite-dimensional stochastic processes (e.g., Holley and Stroock, 
1987). The argument underlying the proof of Proposition 3.2 is based on the unpublished 
notes of Herbst. Ledoux (1996; 2001) pioneered the entropy method in application to a 
wider range of problems. The proof of Theorem 3.4 is based on Ledoux (1996), whereas the 
proofs of Lemmas 3.7 and 3.8 follow the book (Ledoux, 2001). A result of the form in The- 
orem 3.4 was initially proved by Talagrand (1991; 1995; 1996b) using his convex distance 
inequalities. 

The Brunn—Minkowski theorem is a classical result from geometry and real analysis; 
see Gardner (2002) for a survey of its history and connections. Theorem 3.15 was proved 
independently by Prékopa (1971; 1973) and Leindler (1972). Brascamp and Lieb (1976) 
developed various connections between log-concavity and log-Sobolev inequalities; see the 
paper by Bobkov (1999) for further discussion. The inf-convolution argument underlying the 
proof of Theorem 3.16 was initiated by Maurey (1991), and further developed by Bobkov 
and Ledoux (2000). The lecture notes by Ball (1997) contain a wealth of information on 
geometric aspects of concentration, including spherical sections of convex bodies. Harper’s 
theorem quoted in Example 3.13 is proven in the paper (Harper, 1966); it is a special case 
of a more general class of results known as discrete isoperimetric inequalities. 

The Kantorovich—Rubinstein duality (3.55) was established by Kantorovich and Rubin- 
stein (1958); it is a special case of more general results in optimal transport theory (e.g., 
Villani, 2008; Rachev and Ruschendorf, 1998). Marton (1996a) pioneered the use of the 
transportation cost method for deriving concentration inequalities, with subsequent contri- 
butions from various researchers (e.g., Dembo and Zeitouni, 1996; Dembo, 1997; Bobkov 
and Götze, 1999; Ledoux, 2001). See Marton’s paper (1996b) for a proof of Theorem 3.22. 
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The information inequality (3.73) was proved by Samson (2000). As noted following the 
statement of Theorem 3.24, he actually proves a much more general result, applicable to 
various types of dependent random variables. Other results on concentration for dependent 
random variables include the papers (Marton, 2004; Kontorovich and Ramanan, 2008). 

Upper tail bounds on the suprema of empirical processes can be proved using chaining 
methods; see Chapter 5 for more details. Talagrand (1996a) initiated the use of concentration 
techniques to control deviations above the mean, as in Theorems 3.26 and 3.27. The theo- 
rems and entropy-based arguments given here are based on Chapter 7 of Ledoux (2001); 
the sketch in Exercise 3.15 is adapted from arguments in the same chapter. Sharper forms 
of Theorem 3.27 have been established by various authors (e.g., Massart, 2000; Bous- 
quet, 2002, 2003; Klein and Rio, 2005). In particular, Bousquet (2003) proved that the 
bound (3.85) holds with constants cy = V2 and c} = 1/3. There are also various re- 
sults on concentration of empirical processes for unbounded and/or dependent random vari- 
ables (e.g., Adamczak, 2008; Mendelson, 2010); see also Chapter 14 for some one-sided 
results in this direction. 


3.6 Exercises 


Exercise 3.1 (Shannon entropy and Kullback—Leibler divergence) Given a discrete ran- 
dom variable X € X with probability mass function p, its Shannon entropy is given by 
H(X) := — Dyex p(x) log p(x). In this exercise, we explore the connection between the en- 
tropy functional H based on ¢(u) = u log u (see equation (3.2)) and the Shannon entropy. 


(a) Consider the random variable Z = p(U), where U is uniformly distributed over X. Show 

that 
1 

|X| 

(b) Use part (a) to show that Shannon entropy for a discrete random variable is maximized 
by a uniform distribution. 

(c) Given two probability mass functions p and q, specify a choice of random variable Y 
such that H(Y) = D(p||q), corresponding to the Kullback—Leibler divergence between 
p and q. 


H(Z) = {log IXI- H(x)}. 


Exercise 3.2 (Chain rule and Kullback—Leibler divergence) Given two n-variate distribu- 
tions Q and P, show that the Kullback—Leibler divergence can be decomposed as 


DQIP) = DQ IP) + >) Eg ID QO IXP IPCI XP): 
j=2 


where Q;( | a) denotes the conditional distribution of X; given (X1, ...,X;-1) under Q, 
with a similar definition for P ,(- | x) 


Exercise 3.3 (Variational representation for entropy) Show that the entropy has the varia- 
tional representation 


He) = inf Ely(A(x — Nye], (3.89) 
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where y(u) := e™ — 1 + u. 
Exercise 3.4 (Entropy and constant shifts) In this exercise, we explore some properties of 
the entropy. 
(a) Show that for any random variable X and constant c € R, 
H(A) = e% He). 
(b) Use part (a) to show that, if X satisfies the entropy bound (3.5), then so does X + c for 


any constant c. 


Exercise 3.5 (Equivalent forms of entropy) Let H, denote the entropy defined by the con- 
vex function (u) = ulogu — u. Show that H,(e**) = H(e**), where H denotes the usual 
entropy (defined by ¢(u) = ulog u). 


Exercise 3.6 (Entropy rescaling) In this problem, we develop recentering and rescaling 
arguments used in the proof of Proposition 3.3. 


(a) Show that a random variable X satisfies the Bernstein entropy bound (3.10) if and only 
if X = X — E[X] satisfies the inequality 


H(e*) < {bg (A) + yx(A)o*} forall A € [0, 1/b). (3.90) 


(b) Show that a zero-mean random variable X satisfies inequality (3.90) if and only if X= 
X/b satisfies the bound 


He) < Age) + Pex} for all A € [0, 1), 


where &? = o? /b?. 


Exercise 3.7 (Entropy for bounded variables) (Consider a zero-mean random variable X 
taking values in a finite interval [a, b] almost surely. Show that its entropy satisfies the bound 
H(e**) < ËT p,(A) with o := (b — a)/2. (Hint: You may find the result of Exercise 3.3 
useful.) 


Exercise 3.8 (Exponential families and entropy) Consider a random variable Y € Y with 
an exponential family distribution of the form 


Poly) = hyje 7-9, 


where T: Y — R? defines the vector of sufficient statistics, the function A is fixed, and the 
density pọ is taken with respect to base measure u. Assume that the log normalization term 
@(0) = log fy exp((8, T(y)))A(y)u(dy) is finite for all 6 € Rf, and suppose moreover that VA 
is Lipschitz with parameter L, meaning that 


IVA - VEO < Le -8l forall 4,6" € R. (3.91) 
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(a) For fixed unit-norm vector v € Rf, consider the random variable X = (v, T(Y)). Show 
that 


He**) < LA’y,(A)  forallàA eR. 


Conclude that X is sub-Gaussian with parameter V2L. 
(b) Apply part (a) to establish the sub-Gaussian property for: 


(i) the univariate Gaussian distribution Y ~ N(u, o°) (Hint: Viewing o° as fixed, write 
it as a one-dimensional exponential family.) 


(ii) the Bernoulli variable Y € {0,1} with @ = } log E=} 


P[Y=0]° 


Exercise 3.9 (Another variational representation) Prove the following variational repre- 
sentation: 


Hie) = supfE[g(Xe/] | Ele] < 1), 
E 


where the supremum ranges over all measurable functions. Exhibit a function g at which the 
supremum is obtained. (Hint: The result of Exercise 3.5 and the notion of conjugate duality 
could be useful.) 


Exercise 3.10 (Brunn—Minkowski and classical isoperimetric inequality) In this exercise, 
we explore the connection between the Brunn—Minkowski (BM) inequality and the classical 
isoperimetric inequality. 


(a) Show that the BM inequality (3.43) holds if and only if 
vol(A + B)!” > vol(A)!/” + vol(B)!/” (3.92) 


for all convex bodies A and B. 

(b) Show that the BM inequality (3.43) implies the “weaker” inequality (3.45). 

(c) Conversely, show that inequality (3.45) also implies the original BM inequality (3.43). 
(Hint: From part (a), it suffices to prove the inequality (3.92) for bodies A and B with 
strictly positive volumes. Consider applying inequality (3.45) to the rescaled bodies 


SANA .__ B : : 
C= EN and D := TB? and a suitable choice of 2.) 

Exercise 3.11 (Concentration on the Euclidean ball) Consider the uniform measure P over 
the Euclidean unit ball B5 = {x € R” | ||x|lz < 1}. In this example, we bound its concentration 


function using the Brunn—Minkowski inequality (3.45). 


(a) Given any subset A C B}, show that 


1 2 
5lla + bll < 1- < for alla € A and b € (A. 


To be clear, here we define (A‘)* := B\A‘. 
(b) Use the BM inequality (3.45) to show that P[A](1 — P[A‘]) < (1 - Sy, 
(c) Conclude that 


Q Px p)(€) < Jenne l4 for X = BS with pC) = || - |lo. 
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Exercise 3.12 (Rademacher chaos variables) A symmetric positive semidefinite matrix 
Q e S can be used to define a Rademacher chaos variable X = 2i jo Qij&i£j, where 
fet , are i.i.d. Rademacher variables. 


(a) Prove that 


P[X = (ytrace Q + 1)"] < 2exp(- 16 ran) 


(b) Given an arbitrary symmetric matrix M € S“*“, consider the decoupled Rademacher 


chaos variable Y = a jer Mi jE, where CAH is a second i.i.d. Rademacher se- 


quence, independent of the first. Show that 


(3.93) 


82 
PIY >68] <2 , 
yee) exp AIIM + 166 wm) 


(Hint: Part (a) could be useful in an intermediate step.) 


Exercise 3.13 (Total variation and Wasserstein) Consider the Wasserstein distance based 
on the Hamming metric, namely W,(P, Q) = infy M[X + Y], where the infimum is taken 
over all couplings M—that is, distributions on the product space X x X with marginals P and 
Q, respectively. Show that 


inf MIX # Y] = [IP — Qllry = sup IPA) - QA), 


where the supremum ranges over all measurable subsets A of X. 


Exercise 3.14 (Alternative proof) In this exercise, we work through an alternative proof 
of Proposition 3.20. As noted, it suffices to consider the case n = 2. Let P = Pj & P2 bea 
product distribution, and let Q be an arbitrary distribution on X x X. 


(a) Show that the Wasserstein distance W,(Q, P) is upper bounded by 


sae {S| frea- dP»|aQ; + ‘i 


where the supremum ranges over all functions that are 1-Lipschitz with respect to the 
metric p(x, x’) = 02, pi&i x7). 


(b) Use part (a) to show that 
Mi f272D(Qou I| P2) dQi] + V27: DQ; || Pi). 


(c) Complete the proof using part (b). (Hint: Cauchy—Schwarz and Exercise 3.2 could be 
useful. 


EEES aP,| (dQ; - apy}, 


WQ, P) < 


Exercise 3.15 (Bounds for suprema of non-negative functions) Consider a random variable 
of the form Z = supreg Xi-1ı f(Vi) where {V;}7_, is an i.i.d. sequence of random variables, 
and F is a class of functions taking values in the interval [0, 1]. In this exercise, we prove 
that 


log E[e”“] < (e? — 1)E[Z] for any 2 > 0. (3.94) 
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As in our main development, we can reduce the problem to a finite class of functions F, say 
with M functions {f!,..., f”}. Defining the random vectors X; = (f!(V;,),... , FYV) € RY 


ri: 


denote the function X,  Z(X) with all other X; for i + k fixed. 


(a) Define Y(X) := (X1, . . . , Xz-1, 0, Xk+1, Xn). Explain why Z(X) — Z(Y;(X)) = 0. 
(b) Use the tensorization approach and the variational representation from Exercise 3.3 to 
show that 


for all A > 0. 


Hee) < E $ ELW(AZ(X) = ZYX) | xX] 


k=1 


(c) For each £ = 1,..., M, let 


Ae = {r= Gane Re 


sah 


Prove that 
M 
0 < AIZ(X) - ZYX} <4 IX € AXE valid for all 4 > 0. 
t=1 


(d) Noting that Y(t) = e™ + 1 — t is non-negative with yY(0) = 0, argue by the convexity of y 
that 
M 


WA(Z(X) — Z(Y;(X)))) < WA) > I[X € axil for all A> 0. 


{=1 


(e) Combining with previous parts, prove that 


n M 
Hle”) < WA) > F > I[X € axie = WAE[Z(X)e2™}]. 
k=l L 

(Hint: Observe that )77_; F I[X € Axt = Z(X) by definition of the sets Av.) 
(f) Use part (e) to show that yz(A) = E[e”“] satisfies the differential inequality 


A 


[log yz < = log yz(A) for all A > 0, 


-1 
and use this to complete the proof. 
Exercise 3.16 (Different forms of functional Bernstein) Consider a random variable Z that 
satisfies a Bernstein tail bound of the form 
n? 
c1% + c2bô 


PIZ > E[Z] +6] < p|- ) for all ô > 0, 


where cı and c> are universal constants. 


(a) Show that 


for all t > 0. (3.95a) 


t bt 
Z> E[Z] +y y) 
n 
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(b) If, in addition, y? < o°? + c3bE[Z], we have 


b 
Plz a +otziso + (+ S2)}4| ce for all t > 0 and e > 0. 
n 2e jn 
(3.95b) 


4 


Uniform laws of large numbers 


The focus of this chapter is a class of results known as uniform laws of large numbers. As 
suggested by their name, these results represent a strengthening of the usual law of large 
numbers, which applies to a fixed sequence of random variables, to related laws that hold 
uniformly over collections of random variables. On one hand, such uniform laws are of the- 
oretical interest in their own right, and represent an entry point to a rich area of probability 
and statistics known as empirical process theory. On the other hand, uniform laws also play 
a key role in more applied settings, including understanding the behavior of different types 
of statistical estimators. The classical versions of uniform laws are of an asymptotic nature, 
whereas more recent work in the area has emphasized non-asymptotic results. Consistent 
with the overall goals of this book, this chapter will follow the non-asymptotic route, pre- 
senting results that apply to all sample sizes. In order to do so, we make use of the tail 
bounds and the notion of Rademacher complexity previously introduced in Chapter 2. 


4.1 Motivation 


We begin with some statistical motivations for deriving laws of large numbers, first for the 
case of cumulative distribution functions and then for more general function classes. 


4.1.1 Uniform convergence of cumulative distribution functions 


The law of any scalar random variable X can be fully specified by its cumulative distribution 
function (CDF), whose value at any point t € R is given by F(t) := P[X < t]. Now suppose 
that we are given a collection {X;}"_, of n i.i.d. samples, each drawn according to the law 


specified by F. A natural estimate of F is the empirical CDF given by 


n 


= 1 
FD = — D leonlX, (4.1) 
i=1 


where |(_.0,¢j[x] is a {0, 1}-valued indicator function for the event {x < t}. Since the population 
CDF can be written as F(t) = E[l~.0.,4[X]], the empirical CDF is an unbiased estimate. 
Figure 4.1 provides some illustrations of empirical CDFs for the uniform distribution on 
the interval [0,1] for two different sample sizes. Note that F, is a random function, with 
the value F,,(t) corresponding to the fraction of samples that lie in the interval (—os, t]. As 
the sample size n grows, we see that F, approaches F—compare the plot for n = 10 in 
Figure 4.1(a) to that for n = 100 in Figure 4.1(b). It is easy to see that F, converges to F in 
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Figure 4.1 Plots of population and empirical CDF functions for the uniform distri- 
bution on [0, 1]. (a) Empirical CDF based on n = 10 samples. (b) Empirical CDF 
based on n = 100 samples. 


a pointwise sense. Indeed, for any fixed t € R, the random variable F,,(t) has mean F(t), and 
moments of all orders, so that the strong law of large numbers implies that F,,(t) 25 F (t). A 
natural goal is to strengthen this pointwise convergence to a form of uniform convergence. 

Why are uniform convergence results interesting and important? In statistical settings, a 
typical use of the empirical CDF is to construct estimators of various quantities associated 
with the population CDF. Many such estimation problems can be formulated in a terms of 
functional y that maps any CDF F to a real number y(F)—that is, F œ> y(F). Given a set of 
samples distributed according to F, the plug-in principle suggests replacing the unknown F 
with the empirical CDF F,, thereby obtaining yF, ) as an estimate of y(F). Let us illustrate 
this procedure via some examples. 


Example 4.1 (Expectation functionals) Given some integrable function g, we may define 
the expectation functional y; via 


Y(F) := fewaro. (4.2) 


For instance, for the function g(x) = x, the functional y, maps F to E[X], where X is a ran- 
dom variable with CDF F. For any g, the plug-in estimate is given by Y(F,) = 1 18X), 
corresponding to the sample mean of g(X). In the special case g(x) = x, we recover the 
usual sample mean 1 X- X; as an estimate for the mean u = E[X]. A similar interpretation 


applies to other choices of the underlying function g. & 


Example 4.2 (Quantile functionals) For any a € [0, 1], the quantile functional Q, is given 
by 


Qa(F) := inf{t € R | F@ = a}. (4.3) 
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The median corresponds to the special case a = 0.5. The plug-in estimate is given by 


O,(F,) := inf fı ER | a » scl X= a}, (4.4) 
arm 
and corresponds to estimating the ath quantile of the distribution by the ath sample quantile. 
In the special case a = 0.5, this estimate corresponds to the sample median. Again, it is of 
interest to determine in what sense (if any) the random variable O.AF, n) approaches Q,(F) 
as n becomes large. In this case, OAF, n) is a fairly complicated, nonlinear function of all the 
variables, so that this convergence does not follow immediately by a classical result such as 
the law of large numbers. + 


Example 4.3 (Goodness-of-fit functionals) It is frequently of interest to test the hypothesis 
of whether or not a given set of data has been drawn from a known distribution Fo. For 
instance, we might be interested in assessing departures from uniformity, in which case 
Fo would be a uniform distribution on some interval, or departures from Gaussianity, in 
which case Fo would specify a Gaussian with a fixed mean and variance. Such tests can 
be performed using functionals that measure the distance between F and the target CDF Fo, 
including the sup-norm distance ||F — Foll», or other distances such as the Cramér—von Mises 
criterion based on the functional y(F) := es [F(x) — Fo(x)P dFo(x). & 


For any plug-in estimator y(F,), an important question is to understand when it is con- 
sistent—that is, when does Y(F, n) converge to y(F) in probability (or almost surely)? This 
question can be addressed in a unified manner for many functionals by defining a notion of 
continuity. Given a pair of CDFs F and G, let us measure the distance between them using 
the sup-norm 


IIG — Flo t= sup IG(t) — FÐ). (4.5) 


We can then define the continuity of a functional y with respect to this norm: more precisely, 
we say that the functional y is continuous at F in the sup-norm if, for all € > 0, there exists 
a ô > Osuch that ||G — F||.. < 6 implies that |y(G) — y(F)| < €. 

As we explore in Exercise 4.1, this notion is useful, because for any continuous func- 
tional, it reduces the consistency question for the plug-in estimator y(F,,) to the issue of 
whether or not the random variable IIF, n — Flo converges to zero. A classical result, known 
as the Glivenko—Cantelli theorem, addresses the latter question: 


Theorem 4.4 (Glivenko—Cantelli) For any distribution, the empirical CDF F, is a 
strongly consistent estimator of the population CDF in the uniform norm, meaning that 


W= rle ay (4.6) 


We provide a proof of this claim as a corollary of a more general result to follow (see 
Theorem 4.10). For statistical applications, an important consequence of Theorem 4.4 is 
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that the plug-in estimate YF n) is almost surely consistent as an estimator of y(F) for any 
functional y that is continuous with respect to the sup-norm. See Exercise 4.1 for further 
exploration of this connection. 


4.1.2 Uniform laws for more general function classes 


We now turn to more general consideration of uniform laws of large numbers. Let F be 
a class of integrable real-valued functions with domain X, and let {X;}7_, be a collection of 
i.i.d. samples from some distribution P over X. Consider the random variable 


1 n 
IP, — Pll := sup|— X fŒ- rol (4.7) 
feF n 


which measures the absolute deviation between the sample average 1 > 1 f(X) and the pop- 
ulation average E[f(X)], uniformly over the class F. Note that there can be measurability 
concerns associated with the definition (4.7); see the bibliographic section for discussion of 
different ways in which to resolve them. 


Definition 4.5 We say that ¥ is a Glivenko—Cantelli class for P if ||P, — Pll con- 
verges to zero in probability as n — ov. 


This notion can also be defined in a stronger sense, requiring almost sure convergence 
of ||P,, — Pllz, in which case we say that ¥ satisfies a strong Glivenko—Cantelli law. The 
classical result on the empirical CDF (Theorem 4.4) can be reformulated as a particular case 
of this notion: 


Example 4.6 (Empirical CDFs and indicator functions) Consider the function class 
F = {lco | t € R}, (4.8) 


where l-o% is the {0, 1}-valued indicator function of the interval (—co, t]. For each fixed 
t € R, we have the equality E[](~..,4(X)] = P[X < t] = F(t), so that the classical Glivenko— 
Cantelli theorem is equivalent to a strong uniform law for the class (4.8). & 


Not all classes of functions are Glivenko—Cantelli, as illustrated by the following example. 


Example 4.7 (Failure of uniform law) Let S be the class of all subsets S of [0,1] such 
that the subset S has a finite number of elements, and consider the function class Fs = 
{ls(-) | S € S} of ({0-1}-valued) indicator functions of such sets. Suppose that samples 
X; are drawn from some distribution over [0, 1] that has no atoms (i.e., P({x}) = 0 for all 
x € [0,1]); this class includes any distribution that has a density with respect to Lebesgue 
measure. For any such distribution, we are guaranteed that P[S] = 0 for all S € S. On the 
other hand, for any positive integer n € N, the discrete set {X,,...,X,,} belongs to S, and 
moreover, by definition of the empirical distribution, we have P,,[X{] = 1. Putting together 
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the pieces, we conclude that 


sup |P,[S]—P[S]| =1-0=1, (4.9) 
SeS 


so that the function class Fs is not a Glivenko—Cantelli class for P. & 


We have seen that the classical Glivenko—Cantelli law—which guarantees convergence 
of a special case of the variable ||P,, — P||.~—is of interest in analyzing estimators based on 
“plugging in” the empirical CDF. It is natural to ask in what other statistical contexts do 
these quantities arise? In fact, variables of the form ||P,, — P||.¢ are ubiquitous throughout 
statistics—in particular, they lie at the heart of methods based on empirical risk minimiza- 
tion. In order to describe this notion more concretely, let us consider an indexed family of 
probability distributions {Py | @ € Q}, and suppose that we are given n samples {X;}'_,, each 
sample lying in some space X. Suppose that the samples are drawn i.i.d. according to a 
distribution P»:, for some fixed but unknown 6* € Q. Here the index 6* could lie within a 
finite-dimensional space, such as Q = R¢ in a vector estimation problem, or could lie within 
some function class Q = Y, in which case the problem is of the nonparametric variety. 

In either case, a standard decision-theoretic approach to estimating 6* is based on mini- 
mizing a cost function of the form 6 +» L4(X), which measures the “fit” between a parameter 
0 € Q and the sample X € X. Given the collection of n samples {X;}7_,, the principle of em- 
pirical risk minimization is based on the objective function 


S 1“ 
R0,06) := - X LX). 
i=1 


This quantity is known as the empirical risk, since it is defined by the samples X7, and our 
notation reflects the fact that these samples depend—in turn—on the unknown distribution 
Po. This empirical risk should be contrasted with the population risk, 


R(0, 0") := Ee LLX], 


where the expectation Ee is taken over a sample X ~ Pø. 

In practice, one minimizes the empirical risk over some subset Qo of the full space Q, 
thereby obtaining some estimate 6. The statistical question is how to bound the excess risk, 
measured in terms of the population quantities—namely the difference 


E(6, 6") := R(6, 6") — inf R(0, 6°). 
BEQ 
Let us consider some examples to illustrate. 


Example 4.8 (Maximum likelihood) Consider a parameterized family of distributions— 
say {P»,@ € Q}—each with a strictly positive density p» defined with respect to a common 
underlying measure. Now suppose that we are given n i.i.d. samples from an unknown dis- 
tribution Po, and we would like to estimate the unknown parameter 8*. In order to do so, we 
consider the cost function 


L(x) := log 2 =| . 


Po(x) 


The term pẹ (x), which we have included for later theoretical convenience, has no effect on 
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the minimization over 0. Indeed, the maximum likelihood estimate is obtained by minimiz- 
ing the empirical risk defined by this cost function—that is 
Dox (X) 


~ fix Po (Xi) | _ _ {lx 1 
0 Eare o (i 21s | T ATE Gemo (i 2118 Fm) 
—— 
The population risk is given by R(6, 6") = E,-[ log Pak) ], a quantity known as the Kullback— 


Leibler divergence between pg: and pg. In the special case that 6* € Qo, the excess risk is 
simply the Kullback—Leibler divergence between the true density pọ and the fitted model 
pj. See Exercise 4.3 for some concrete examples. 4 


Example 4.9 (Binary classification) Suppose that we observe n pairs of samples, each of 
the form (X;, Y;) € R? x {-1, +1}, where the vector X; corresponds to a set of d predictors or 
features, and the binary variable Y; corresponds to a label. We can view such data as being 
generated by some distribution Px over the features, and a conditional distribution Py). 
Since Y takes binary values, the conditional distribution is fully specified by the likelihood 
ratio W(x) = eae 

The goal of binary classification is to estimate a function f: R? — {—1, +1} that min- 
imizes the probability of misclassification P[f(X) + Y], for an independently drawn pair 
(X, Y). Note that this probability of error corresponds to the population risk for the cost 
function 

LX, Y) := AT d (4.10) 
0 otherwise. 

A function that minimizes this probability of error is known as a Bayes classifier f*; in 
the special case of equally probable classes—that is, when P[Y = +1] = P[Y = -1] = ; a 
Bayes classifier is given by 


woe a if W(x) > 1, 


—] otherwise. 


Since the likelihood ratio y (and hence f*) is unknown, a natural approach to approximating 
the Bayes rule is based on choosing f to minimize the empirical risk 


& 1< 
Rf) = = DUK) # Yi, 
n i _— OO 
LEXY) 
corresponding to the fraction of training samples that are misclassified. Typically, the min- 


imization over f is restricted to some subset of all possible decision rules. See Chapter 14 
for some further discussion of how to analyze such methods for binary classification. & 


Returning to the main thread, our goal is to develop methods for controlling the excess 
risk. For simplicity, let us assume! that there exists some 6) € Qo such that R(@%, 6") = 


' Tf the infimum is not achieved, then we choose an element 6) for which this equality holds up to some 
arbitrarily small tolerance e > 0, and the analysis to follow holds up to this tolerance. 
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infgen, R(0, 0*). With this notation, the excess risk can be decomposed as 


E(6, 6°) = {R@, 6°) — R,(0, 6°)} + {R,@, O°) — Ry(90.0")} + {Rn(o. 0") — RG, 0). 
-e—$S$—SS M -e-_——$— -e-_——$—SSO————— 


Tı T2<0 T3 


Note that T> is non-positive, since @ minimizes the empirical risk over Qo. 

The third term T; can be dealt with in a relatively straightforward manner, because 6p is 
an unknown but non-random quantity. Indeed, recalling the definition of the empirical risk, 
we have 


1 n 
T; = f > ta) - Ex[La (X), 
i=1 


corresponding to the deviation of a sample mean from its expectation for the random variable 
La (X). This quantity can be controlled using the techniques introduced in Chapter 2—for 
instance, via the Hoeffding bound when the samples are independent and the cost function 
is bounded. 

Finally, returning to the first term, it can be written in a similar way, namely as the differ- 
ence 


1 n 
Ty = Ex(£)(X)] - $ X 2) 
i=1 


This quantity is more challenging to control, because the parameter @—in contrast to the 
deterministic quantity 6)>—is now random, and moreover depends on the samples {X;}'_,, 
since it was obtained by minimizing the empirical risk. For this reason, controlling the first 
term requires a stronger result, such as a uniform law of large numbers over the cost function 


class (Qo) := {x BH Lo(x), 6 € Qo}. With this notation, we have 


1 n 
= J LX) ~ ExtLo(X)]] = Pa — Plea). 


i=1 


Tı < sup 

OEQo 

Since T; is also dominated by this same quantity, we conclude that the excess risk is at 

most 2\||IP,, — Pllece,). This derivation demonstrates that the central challenge in analyzing 

estimators based on empirical risk minimization is to establish a uniform law of large num- 

bers for the loss class (Qo). We explore various concrete examples of this procedure in the 
exercises. 


4.2 A uniform law via Rademacher complexity 


Having developed various motivations for studying uniform laws, let us now turn to the 
technical details of deriving such results. An important quantity that underlies the study of 
uniform laws is the Rademacher complexity of the function class F. For any fixed collection 
xi ‘= (X1, . - -, Xn) of points, consider the subset of R” given by 


F (xt) = (f(r), --- fn) | f € F}. (4.11) 


The set F (xf) corresponds to all those vectors in R” that can be realized by applying a 
function f € F to the collection (x;,...,x,), and the empirical Rademacher complexity is 
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given by 


n 


I > sif (xi) 


i=1 


RF (x7) /n) = Es sn | (4.12) 
SJEF 
Note that this definition coincides with our earlier definition of the Rademacher complexity 
of a set (see Example 2.25). 
Given a collection X7 := {X;};; of random samples, then the empirical Rademacher com- 
plexity R(.F(X7)/n) is a random variable. Taking its expectation yields the Rademacher 
complexity of the function class # —namely, the deterministic quantity 


RF) := Ex[R(F (X7)/n)] = es 
JEF 


n 
I > Ei 10% . (4.13) 
i=1 

Note that the Rademacher complexity is the average of the maximum correlation between 
the vector (f(X1),..., f(Xn)) and the “noise vector” (€),...,&,), where the maximum is 
taken over all functions f € F. The intuition is a natural one: a function class is extremely 
large—and, in fact, “too large” for statistical purposes—if we can always find a function 
that has a high correlation with a randomly drawn noise vector. Conversely, when the Rade- 
macher complexity decays as a function of sample size, then it is impossible to find a func- 
tion that correlates very highly in expectation with a randomly drawn noise vector. 


We now make precise the connection between Rademacher complexity and the Glivenko— 
Cantelli property, in particular by showing that, for any bounded function class F, the con- 
dition R,( F) = o(1) implies the Glivenko—Cantelli property. More precisely, we prove a 
non-asymptotic statement, in terms of a tail bound for the probability that the random vari- 
able ||P, — Pll deviates substantially above a multiple of the Rademacher complexity. It 
applies to a function class ¥ that is b-uniformly bounded, meaning that ||/||.. < b for all 
feF. 


X 
Theorem 4.10 For any b-uniformly bounded class of functions F, any positive inte- 
gern => | and any scalar ô = 0, we have 

Py = Pile < 2R(F) +0 (4.14) 
with P-probability at least 1 — exp (25). Consequently, as long as R,(F) = o(1), we 
have ||P, — Pllg = 0. 

Ù 4 


In order for Theorem 4.10 to be useful, we need to obtain upper bounds on the Rade- 
macher complexity. There are a variety of methods for doing so, ranging from direct cal- 
culations to alternative complexity measures. In Section 4.3, we develop some techniques 
for upper bounding the Rademacher complexity for indicator functions of half-intervals, as 
required for the classical Glivenko—Cantelli theorem (see Example 4.6); we also discuss the 
notion of Vapnik—Chervonenkis dimension, which can be used to upper bound the Rade- 
macher complexity for other function classes. In Chapter 5, we introduce more advanced 
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techniques based on metric entropy and chaining for controlling Rademacher complexity 
and related sub-Gaussian processes. In the meantime, let us turn to the proof of Theo- 
rem 4.10. 


Proof We first note that if R,( F) = o(1), then the almost-sure convergence follows from 
the tail bound (4.14) and the Borel—Cantelli lemma. Accordingly, the remainder of the argu- 
ment is devoted to proving the tail bound (4.14). 


Concentration around mean: We first claim that, when F is uniformly bounded, then the 
random variable ||P,, — P||.¢ is sharply concentrated around its mean. In order to simplify 
notation, it is convenient to define the recentered functions f(x) := f(x) — E[f(X)], and to 
write ||P,, — P||.z = sup JeF 1 A fX]. Thinking of the samples as fixed for the moment, 
consider the function 


IO; 
G(%1,...5Xn) := sup |— Hic 
fee et 
We claim that G satisfies the Lipschitz property required to apply the bounded differences 
method (recall Corollary 2.21). Since the function G is invariant to permutation of its coor- 
dinates, it suffices to bound the difference when the first coordinate x, is perturbed. Accord- 


ingly, we define the vector y € R” with y; = x; for all i + 1, and seek to bound the difference 
|G(x) — G(y)|. For any function f = f — E[f], we have 


1 ; F 1 z = 1 n £ 1 n 7 
FD Fes ee rAr < z aF -|5 Fos] 
1|- 2 
< =| fe) = fO) 
<2, (4.15) 
n 


where the final inequality uses the fact that 


FED -= Fowl = Fav - fO) < 2b, 


which follows from the uniform boundedness condition ||fllo < b. Since the inequality (4.15) 
holds for any function f, we may take the supremum over f € F on both sides; doing so 
yields the inequality G(x) — G(y) < 2, Since the same argument may be applied with the 
roles of x and y reversed, we conclude that |G(x) — G(y)| < 72. Therefore, by the bounded 


differences method (see Corollary 2.21), we have 


IP, - Pll. —-E{iP, -Plla¢]l<t with P-prob. at least 1 — exp (-5 : (4.16) 


valid for all t > 0. 


Upper bound on mean: It remains to show that E[||P,,— P||.z] is upper bounded by 2R,,(.F), 
and we do so using a classical symmetrization argument. Letting (Y,,..., Y,,) be a second 
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1.i.d. sequence, independent of (X;,...,X,), we have 
1 n 
ELIP, — Pll] = sup D - saron] 
1 n 
: sup er 2 F - soi 
i p 
< xy su — > LF) = revo) (4.17) 
feF |" 


where the upper bound (i) follows from the calculation of Exercise 4.4. 

Now let (€1,...,&,) be an i.i.d. sequence of Rademacher variables, independent of X 
and Y. Given our independence assumptions, for any function f € .F, the random vector 
with components £;( f(X — f(Y;)) has the same joint distribution as the random vector with 
components f(X;) — f(Y;), whence 


1 n 
ii >» 27 OG) = ro 


i=1 


1 n 
xy su = StF) = revo He [sp 
feF |My JEF 


n 


I vei ræ =2R(F). (4.18) 


i=1 


<2E v [sup 
JEF 


Combining the upper bound (4.18) with the tail bound (4.16) yields the claim. 


4.2.1 Necessary conditions with Rademacher complexity 


The proof of Theorem 4.10 illustrates an important technique known as symmetrization, 
which relates the random variable ||P,, — P||.¢ to its symmetrized version 


I y cif (Xd 


i=1 


; (4.19) 


IISyllg := sup 
SJEF 


Note that the expectation of ||S„||> corresponds to the Rademacher complexity, which plays 
a central role in Theorem 4.10. It is natural to wonder whether much was lost in moving 
from the variable ||P,, — P||g to its symmetrized version. The following “sandwich” result 
relates these quantities. 


Proposition 4.11 For any convex non-decreasing function ®: R — R, we have 


Exel |Sullg)] È ExlO( IP, - PIAI E Exel®QISilla) (4.20) 


where F = {f —E[f], f € F} is the recentered function class. 


When applied with the convex non-decreasing function ®(t) = t, Proposition 4.11 yields the 
inequalities 


Ex ellSalle < FyillP, = Pile] < 2ExellSullge, (4.21) 


Nile 
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with the only differences being the constant pre-factors, and the use of .F in the upper bound, 
and the recentered class Z in the lower bound. 

Other choices of interest include ®(t) = e” for some A > 0, which can be used to control 
the moment generating function. 


Proof Beginning with bound (b), we have 


L YAK) -E stron 
i=1 

1 n 

-X fX) - ro) 

n i=1 


LY efx) - ræ 


i=1 


Ex[®(IP, — Pliz)] = Ex lef sup 
JEF 


< Ex,y lef sup 
JEF 


(i) 
= Ly Ye ofsup 
JEF 


:=T; 


where inequality (i) follows from Exercise 4.4, using the convexity and non-decreasing prop- 
erties of ®, and equality (ii) follows since the random vector with components ¢;(f(X;) — 
f(%)) has the same joint distribution as the random vector with components f(X;) — f(¥j). 
By the triangle inequality, we have 


I ` sif (Xi) 


i=1 


+ 


Sern) 


i= 


1 
1 

+ -E roz sup 
2 JEF 


Tı < ACI sup 
SJEF 


n 


. > asa) 


t 
£L Ex,{o(2 sup 
2 SEF |" E 
1 n 
Ef (2 sup |- >, aso} 
feF |My 
where step (iii) follows from Jensen’s inequality and the convexity of ®, and step (iv) follows 


since X and Y are i.i.d. samples. 
Turning to the bound (a), we have 


Sen 


i= 


L x [OGIS] = Xe 


z et f(X;) — E vireo} 
=1 


X) — ror , 
1 


where inequality (i) follows from Jensen’s inequality and the convexity of ®; and equality 
(ii) follows since for each i = 1,2,...,n and f € F, the variables e{f(X;) — f(Y;)} and 
f(XD — f(¥;) have the same distribution. 

Now focusing on the quantity T> := 5 SUP fee 1 Xf (X) - fD, we add and subtract 
a term of the form E[f], and then apply the triangle inequality, thereby obtaining the upper 
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bound 


PDU- 


Since ® is convex and non-decreasing, we are guaranteed that 


J+ sofa z U- r) 
=1 


The claim follows by taking expectations and using the fact that X and Y are identically 
distributed. 


1 
Ta <5 sup|- 2 fi) - af 


= T 
2 fef 


1 
(T2) < -| sup 
2 SJEF 


1 n 
- X FX) - ED 
n i=1 


A consequence of Proposition 4.11 is that the random variable ||P„ — Plz can be lower 
bounded by a multiple of Rademacher complexity, and some fluctuation terms. This fact can 
be used to prove the following: 


Proposition 4.12 For any b-uniformly bounded function class F, any integer n > 1 
and any scalar ô > 0, we have 
_ Supyegx HAI 


> a (4.22) 


1 


nô? 
with P-probability at least 1 — e >. 


d 


We leave the proof of this result for the reader (see Exercise 4.5). As a consequence, if the 
Rademacher complexity R,( F) remains bounded away from zero, then ||P,, — P||.z cannot 
converge to zero in probability. We have thus shown that, for a uniformly bounded function 
class F, the Rademacher complexity provides a necessary and sufficient condition for it to 
be Glivenko—Cantelli. 


4.3 Upper bounds on the Rademacher complexity 


Obtaining concrete results using Theorem 4.10 requires methods for upper bounding the 
Rademacher complexity. There are a variety of such methods, ranging from simple union 
bound methods (suitable for finite function classes) to more advanced techniques involv- 
ing the notion of metric entropy and chaining arguments. We explore the latter techniques 
in Chapter 5 to follow. This section is devoted to more elementary techniques, including 
those required to prove the classical Glivenko—Cantelli result, and, more generally, those 
that apply to function classes with polynomial discrimination, as well as associated Vapnik— 
Chervonenkis classes. 


4.3.1 Classes with polynomial discrimination 


For a given collection of points x} = (x1,...,%,), the “size” of the set F (x1) provides a 
sample-dependent measure of the complexity of F. In the simplest case, the set F (x}) con- 
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tains only a finite number of vectors for all sample sizes, so that its “size” can be measured 
via its cardinality. For instance, if consists of a family of decision rules taking binary 
values (as in Example 4.9), then F (x}) can contain at most 2” elements. Of interest to us 
are function classes for which this cardinality grows only as a polynomial function of n, as 
formalized in the following: 


Definition 4.13 (Polynomial discrimination) A class ¥ of functions with domain X 
has polynomial discrimination of order v > 1 if, for each positive integer n and col- 
lection x} = {x1,...,X,} of n points in X, the set Ka (x7) has cardinality upper bounded 
as 


card( F (xD) < (n+ 1)”. (4.23) 


d 


The significance of this property is that it provides a straightforward approach to controlling 
the Rademacher complexity. For any set S c R”, we use D(S) := sup,es llxll2 to denote its 
maximal width in the ¢-norm. 


Lemma 4.14 Suppose that F has polynomial discrimination of order v. Then for all 
positive integers n and any collection of points x} = (x1, ... , Xn), 


l +i 
[e i 
RCF )/n)) 


where D(x}) := SUP fe 7 Ni ee is the €-radius of the set F(x) /Vn. 


n 


1 
= pei OG) 


n 
i=1 


E sup 
JEF 


Àx 


We leave the proof of this claim for the reader (see Exercise 4.9). 


Although Lemma 4.14 is stated as an upper bound on the empirical Rademacher com- 
plexity, it yields as a corollary an upper bound on the Rademacher complexity R,( F) = 
Ex[R(F (X7)/n)], one which involves the expected €-width E x[D(X)]. An especially sim- 
ple case is when the function class is b uniformly bounded, so that D(x) < b for all samples. 
In this case, Lemma 4.14 implies that 


RATS LIR U desi (4.24) 
n 


Combined with Theorem 4.10, we conclude that any bounded function class with poly- 
nomial discrimination is Glivenko—Cantelli. 


What types of function classes have polynomial discrimination? As discussed previously 
in Example 4.6, the classical Glivenko—Cantelli law is based on indicator functions of the 
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left-sided intervals (—co, t]. These functions are uniformly bounded with b = 1, and more- 
over, as shown in the following proof, this function class has polynomial discrimination of 
order v = 1. Consequently, Theorem 4.10 combined with Lemma 4.14 yields a quantitative 
version of Theorem 4.4 as a corollary. 


la N 
Corollary 4.15 (Classical Glivenko—Cantelli) Let F(t) = P[X < t] be the CDF ofa 
random variable X, and let F„ be the empirical CDF based on n i.i.d. samples X; ~ P. 


Then 
a l 1 nô2 
eli - Fl» > 8 aa + | <et koez (4.25) 
n 


and hence IIF, — Fl% O: 
a p 


Proof For a given sample x} = (x1,...,Xn) € R”, consider the set F (x7), where F is 
the set of all {0-1}-valued indicator functions of the half-intervals (—co, t] for t € R. If we 
order the samples as xa) < Xo) < ++: < Xm, then they split the real line into at most n + 1 
intervals (including the two end-intervals (—o0, x(1)) and [xn), œ0)). For a given ¢, the indicator 
function l-œ, takes the value one for all xij < t, and the value zero for all other samples. 
Thus, we have shown that, for any given sample x}, we have card(.F (x1) < n+ 1. Applying 


Lemma 4.14, we obtain 
F l sup < 4 losgir. 
SJEF V n 


and taking averages over the data X; yields the upper bound R,( F) < 4 tester) The 
claim (4.25) then follows from Theorem 4.10. 


I y sif (Xi) 


i=1 


Although the exponential tail bound (4.25) is adequate for many purposes, it is far from 
the tightest possible. Using alternative methods, we provide a sharper result that removes the 
ylog(n + 1) factor in Chapter 5. See the bibliographic section for references to the sharpest 
possible results, including control of the constants in the exponent and the pre-factor. 


4.3.2 Vapnik—Chervonenkis dimension 


Thus far, we have seen that it is relatively straightforward to establish uniform laws for 
function classes with polynomial discrimination. In certain cases, such as in our proof of 
the classical Glivenko—Cantelli law, we can verify by direct calculation that a given function 
class has polynomial discrimination. More broadly, it is of interest to develop techniques 
for certifying this property in a less laborious manner. The theory of Vapnik—Chervonenkis 
(VC) dimension provides one such class of techniques. Accordingly, we now turn to defining 
the notions of shattering and VC dimension. 
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Let us consider a function class Y in which each function f is binary-valued, taking the 
values {0, 1} for concreteness. In this case, the set F (x1) from equation (4.11) can have at 
most 2” elements. 


Definition 4.16 (Shattering and VC dimension) Given a class -¥ of binary-valued 
functions, we say that the set x] = (x1, . . - , Xn) is shattered by F if card(F¥(x7)) = 2”. 
The VC dimension v(.F) is the largest integer n for which there is some collection 
Xt = (X1, -- -, Xn) of n points that is shattered by F. 


When the quantity v(.F) is finite, then the function class F is said to be a VC class. We 
will frequently consider function classes F that consist of indicator functions Is [-], for sets 
S ranging over some class of sets S. In this case, we use S(x/) and v(S) as shorthands for 
the sets -F (x1) and the VC dimension of F , respectively. For a given set class S, the shatter 
coefficient of order n is given by max," card(S(x7)). 


Let us illustrate the notions of shattering and VC dimension with some examples: 


Example 4.17 (Intervals in R) Consider the class of all indicator functions for left-sided 
half-intervals on the real line—namely, the class Siet := {(—co,a] | a € R}. Implicit in the 
proof of Corollary 4.15 is a calculation of the VC dimension for this class. We first note 
that, for any single point x,, both subsets ({x,} and the empty set Ø) can be picked out by 
the class of left-sided intervals {(—co,a] | a € R}. But given two distinct points x; < x3, it 
is impossible to find a left-sided interval that contains xz but not xı. Therefore, we conclude 
that v(Siet) = 1. In the proof of Corollary 4.15, we showed more specifically that, for any 
collection x} = {x1, . . - , Xn}, we have card(Sjen(x})) < n+ 1. 

Now consider the class of all two-sided intervals over the real line—namely, the class 
Siwo := {(b, a] | a,b € R such that b < a}. The class Siwo can shatter any two-point set. How- 
ever, given three distinct points x; < x2 < x3, it cannot pick out the subset {x,, x3}, showing 
that v(S,,.) = 2. For future reference, let us also upper bound the shatter coefficients of Siwo- 
Note that any collection of n distinct points xı < x2 < +++ < Xn-1 < x, divides up the real line 
into (n + 1) intervals. Thus, any set of the form (—b, a] can be specified by choosing one of 
(n + 1) intervals for b, and a second interval for a. Thus, a crude upper bound on the shatter 
coefficient of order n is 


card(Swwo(x7)) < (n+ 1)’, 
showing that this class has polynomial discrimination with degree v = 2. & 


Thus far, we have seen two examples of function classes with finite VC dimension, both 
of which turned out also to have polynomial discrimination. Is there a general connection 
between the VC dimension and polynomial discriminability? Indeed, it turns out that any 
finite VC class has polynomial discrimination with degree at most the VC dimension; this 
fact is a deep result that was proved independently (in slightly different forms) in papers by 
Vapnik and Chervonenkis, Sauer and Shelah. 

In order to understand why this fact is surprising, note that, for a given set class S, the 
definition of VC dimension implies that, for all n > v(S), then it must be the case that 
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card(S(x1)) < 2” for all collections x} of n samples. However, at least in principle, there 
could exist some subset with 


card(S(x})) = 2” - 1, 


which is not significantly different from 2”. The following result shows that this is not the 
case; indeed, for any VC class, the cardinality of S(x/) can grow at most polynomially in n. 


la N 
Proposition 4.18 (Vapnik-Chervonenkis, Sauer and Shelah) Consider a set class S 
with v(S) < œ. Then for any collection of points P = (x),...,X,) with n > v(S), we 
have 


. x(S) K 
card(S(P)) È (") < (n+ 1, (4.26) 


i=0 
h 4 


Given inequality (i), inequality (ii) can be established by elementary combinatorial argu- 
ments, so we leave it to the reader (in particular, see part (a) of Exercise 4.11). Part (b) of 
the same exercise establishes a sharper upper bound. 


Proof Given a subset of points Q and a set class T, we let (J; Q) denote the VC dimen- 
sion of J when considering only whether or not subsets of Q can be shattered. Note that 
v(T) < k implies that v(T ; Q) < k for all point sets Q. For positive integers (n, k), define the 
functions 


k 
®,(n):= sup sup card(T(Q)) and Y(n) DIN 


point sets Q set classes T i=0 
card(Q)<n WT 3Q)Sk 


Here we agree that G) = 0 whenever i > n. In terms of this notation, we claim that it suffices 
to prove that 


O(n) < Y(n). (4.27) 


Indeed, suppose there were some set class S with v(S) = k and collection P = {x1,..., Xn} 
of n distinct points for which card(S(P)) > P(n). By the definition ®,(n), we would then 
have 


(i) (ii) 
®,(n) > sup card(V(P)) > card(S(P)) > Y(n), (4.28) 
set classes T 
WT 3P)<k 


which contradicts the claim (4.27). Here inequality (i) follows because P is feasible for the 
supremum over Q that defines ®,(n); and inequality (ii) follows because v(S) = k implies 
that v(S; P) < k. 


We now prove the claim (4.27) by induction on the sum n + k of the pairs (n, k). 


Base case: To start, we claim that inequality (4.27) holds for all pairs with n + k = 2. 
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The claim is trivial if either n = O or k = 0. Otherwise, for (n,k) = (1,1), both sides of 
inequality (4.27) are equal to 2. 


Induction step: Now assume that, for some integer € > 2, the inequality (4.27) holds for 
all pairs with n + k < £. We claim that it then holds for all pairs with n + k = £. Fix an 
arbitrary pair (n,k) such that n + k = £, a point set P = {x,,...,x,} and a set class S such 
that v(S; P) = k. Define the point set P’ = P \ {xı}, and let So E S be the smallest collection 
of subsets that labels the point set P’ in the maximal number of different ways. Let S; be the 
smallest collection of subsets inside S \ So that produce binary labelings of the point set P 
that are not in So(P). (The choices of Sy and S; need not be unique.) 

As a concrete example, given a set class S = {51, 52, 53, 4} and a point set P = {x1, x2, x3}, 
suppose that the sets generated the binary labelings 


sı © (0,1,1), s2 (1,1,1), s3 (0,1,0), 54 (0,1,1). 


In this particular case, we have S(P) = {(0, 1, 1), (1, 1, 1), (0, 1, 0)}, and one valid choice of 
the pair (So, S1) would be So = {s1, s3} and S; = {s2}, generating the labelings So(P) = 
{(0, 1, 1), (0, 1,0)} and S,(P) = {(1,1, 1)}. 

Using this decomposition, we claim that 


card(S(P)) = card(So(P’)) + card(S1(P')). 


Indeed, any binary labeling in S(P) is either mapped to a member of So(P’), or in the case 
that its labeling on P’ corresponds to a duplicate, it can be uniquely identified with a member 
of S;(P’). This can be verified in the special case described above. 

Now since P’ is a subset of P and Sg is a subset of S, we have 


v(So; P’) < v(So; P) < k. 


Since the cardinality of P’ is equal to n — 1, the induction hypothesis thus implies that 
card(So(P’)) < Y(n — 1). 

On the other hand, we claim that the set class S; satisfies the upper bound v(S,; P’) < k- 1. 
Suppose that Sı shatters some subset Q’ C P’ of cardinality m; it suffices to show that 
m < k- 1. If S; shatters such a set Q’, then S would shatter the set Q = Q’ U {x1} C P. (This 
fact follows by construction of S: for every binary vector in the set S,(P), the set S(P) 
must contain a binary vector with the label for x, flipped; see the concrete example given 
above for an illustration.) Since v(S; P) < k, it must be the case that card(Q) = m+1 < k, 
which implies that v(S;; P’) < k — 1. Consequently, the induction hypothesis implies that 
card(S,(P’)) < Y-i (n — 1). 

Putting together the pieces, we have shown that 


card(S(P)) < Y(n — 1) + Yin — 1) Ë ¥(n), (4.29) 


where the equality (1) follows from an elementary combinatorial argument (see Exercise 4.10). 
This completes the proof. 
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4.3.3 Controlling the VC dimension 


Since classes with finite VC dimension have polynomial discrimination, it is of interest to 
develop techniques for controlling the VC dimension. 


Basic operations 


The property of having finite VC dimension is preserved under a number of basic operations, 
as summarized in the following. 


Proposition 4.19 Let S andTJ be set classes, each with finite VC dimensions v(S) and 
v(T ), respectively. Then each of the following set classes also have finite VC dimension: 


(a) The set class S° := {S° | S € S}, where S° denotes the complement of S. 
(b) The set class SUT :={S UT|S ES, TET}. 
(c) The set class SNF :={SAT|S ES, TET}. 


We leave the proof of this result as an exercise for the reader (Exercise 4.8). 


Vector space structure 


Any class Y of real-valued functions defines a class of sets by the operation of taking sub- 
graphs. In particular, given a real-valued function g: X — R, its subgraph at level zero is 
the subset S, := {x € X | g(x) < 0}. In this way, we can associate to Y the collection of 
subsets S(Y) := {S,, g € F}, which we refer to as the subgraph class of Y. Many interesting 
classes of sets are naturally defined in this way, among them half-spaces, ellipsoids and so 
on. In many cases, the underlying function class Y is a vector space, and the following result 
allows us to upper bound the VC dimension of the associated set class S(Y). 


Proposition 4.20 (Finite-dimensional vector spaces) Let Y be a vector space of func- 
tions g: R? — R with dimension dim(Y) < œ. Then the subgraph class S(Y) has VC 
dimension at most dim(@). 


Proof By the definition of VC dimension, we need to show that no collection of n = 
dim(¥Y) + 1 points in R? can be shattered by S(Y). Fix an arbitrary collection xi = 
{x1,...,Xn} of n points in R, and consider the linear map L: Y — R” given by L(g) = 
(g(x1),---,8(Xn)). By construction, the range of the mapping L is a linear subspace of R” 
with dimension at most dim(Y) = n — 1 < n. Therefore, there must exist a non-zero vector 
y € R” such that <y, L(g)) = 0 for all g € Y. We may assume without loss of generality that 
at least one coordinate is positive, and then write 


>, Cvs) = J, vise) — forall ge 9. (4.30) 
} 


{il y:<0} {ily:>0 
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Proceeding via proof by contradiction, suppose that there were to exist some g € Y such 
that the associated subgraph set S, = {x € IR? | g(x) < 0} included only the subset {x; | y; < 
0}. For such a function g, the right-hand side of equation (4.30) would be strictly positive 
while the left-hand side would be non-positive, which is a contradiction. We conclude that 


S(Y) fails to shatter the set {x;, . . . , Xn}, as claimed. 


Let us illustrate the use of Proposition 4.20 with some examples: 


Example 4.21 (Linear functions in Rf) For a pair (a,b) € R¢ x R, define the function 
fap(X) := (a, x) + b, and consider the family Z’ := {fap | (a,b) € R? x R} of all such 
linear functions. The associated subgraph class S(-#“) corresponds to the collection of all 
half-spaces of the form H4» := {x € R’ | (a, x) +b < 0}. Since the family “4 forms a vector 
space of dimension d + 1, we obtain as an immediate consequence of Proposition 4.20 that 
S(Z*) has VC dimension at most d + 1. 

For the special case d = 1, let us verify this statement by a more direct calculation. 
In this case, the class S(-#') corresponds to the collection of all left-sided or right-sided 
intervals—that is, 


S(L!) = {(-0o, t] | t € R} U {[t, œ) | t € R}. 


Given any two distinct points x; < x2, the collection of all such intervals can pick out 
all possible subsets. However, given any three points xı < x2 < x3, there is no interval 
contained in S(_#') that contains x» while excluding both x; and x3. This calculation shows 
that v(S(')) = 2, which matches the upper bound obtained from Proposition 4.20. More 
generally, it can be shown that the VC dimension of S() is d + 1, so that Proposition 4.20 
yields a sharp result in all dimensions. A 


Example 4.22 (Spheres in R“) Consider the sphere Sap := {x € R? | lix- all < b}, 
where (a,b) € R? x R, specify its center and radius, respectively, and let Sher denote the 
collection of all such spheres. If we define the function 


d 
; 2 p2 
fap (x) := Ixl — 2 > ajx; + llall — b7, 


j=l 


then we have Sap = {x € R? | f,,(x) < 0}, so that the sphere S4, is a subgraph of the 
function fap. 

In order to leverage Proposition 4.20, we first define a feature map ¢: R? > R? via 
A(x) := (1, x1,..-,Xa; IIxlI5), and then consider functions of the form 


g(x) := (c, 6(x)) where c € R¢*?, 


The family of functions {g,,c € R**'} is a vector space of dimension d + 2, and it contains 
the function class {fa p, (a, b) € R? x R,}. Consequently, by applying Proposition 4.20 to this 
larger vector space, we conclude that WSS here) < d+ 2. This bound is adequate for many 
purposes, but is not sharp: a more careful analysis shows that the VC dimension of spheres 


in Rf is actually d + 1. See Exercise 4.13 for an in-depth exploration of the cased =2. & 
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4.4 Bibliographic details and background 


First, a technical remark regarding measurability: in general, the normed difference 
||P, — Pl] need not be measurable, since the function class may contain an uncount- 
able number of elements. If the function class is separable, then we may simply take the 
supremum over the countable dense basis. Otherwise, for a general function class, there are 
various ways of dealing with the issue of measurability, including the use of outer probabil- 
ity (cf. van der Vaart and Wellner (1996)). Here we instead adopt the following convention, 
suitable for defining expectations of any function ¢ of ||P,, — Pl|.z. For any finite class of 
functions Y contained within F, the random variable ||P,, — P|l¢g is well defined, so that it 
is sensible to define 


E[ (IP, — Plliz)] := sup{E[@d|P, - Pllg)]|¢Y c F, GY has finite cardinality}. 


By using this definition, we can always think instead of expectations defined via suprema 
over finite sets. 

Theorem 4.4 was originally proved by Glivenko (1933) for the continuous case, and by 
Cantelli (1933) in the general setting. The non-asymptotic form of the Glivenko—Cantelli 
theorem given in Corollary 4.15 can be sharpened substantially. For instance, Dvoretsky, 
Kiefer and Wolfowitz (1956) prove that there is a constant C independent of F and n such 
that 


PIF, — Flle 2> 6] < Ce”? forall 6 > 0. (4.31) 


Massart (1990) establishes the sharpest possible result, with the leading constant C = 2. 

The Rademacher complexity, and its relative the Gaussian complexity, have a lengthy 
history in the study of Banach spaces using probabilistic methods; for instance, see the 
books (Milman and Schechtman, 1986; Pisier, 1989; Ledoux and Talagrand, 1991). Rade- 
macher and Gaussian complexities have also been studied extensively in the specific context 
of uniform laws of large numbers and empirical risk minimization (e.g. van der Vaart and 
Wellner, 1996; Koltchinskii and Panchenko, 2000; Koltchinskii, 2001, 2006; Bartlett and 
Mendelson, 2002; Bartlett et al., 2005). In Chapter 5, we develop further connections be- 
tween these two forms of complexity, and the related notion of metric entropy. 

Exercise 5.4 is adapted from Problem 2.6.3 from van der Vaart and Wellner (1996). The 
proof of Proposition 4.20 is adapted from Pollard (1984), who credits it to Steele (1978) and 
Dudley (1978). 


4.5 Exercises 


Exercise 4.1 (Continuity of functionals) Recall that the functional y is continuous in the 
sup-norm at F if for all € > 0, there exists a 6 > O such that ||G — F||.. < 6 implies that 
IG) =- WF) < €. 


(a) Given n i.i.d. samples with law specified by F, let F, be the empirical CDF. Show that 


=> rob. 
if y is continuous in the sup-norm at F, then y(F,,) aes VF). 
(b) Which of the following functionals are continuous with respect to the sup-norm? Prove 
or disprove. 
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(i) The mean functional F œ> f xdF(x). 
Gi) The Cramér—von Mises functional F —> f [F(x) — Fo(x)P dFo(x). 
(iii) The quantile functional Q,(F) = inf{t € R | F@® = a}. 


Exercise 4.2 (Failure of Glivenko—Cantelli) Recall from Example 4.7 the class S of all 
subsets S of [0, 1] for which S has a finite number of elements. Prove that the Rademacher 
complexity satisfies the lower bound 


n 


: >) als [X1 


i=1 


R, (S) = Exe sn 


SeS 


1 
| 2 z (4.32) 


Discuss the connection to Theorem 4.10. 


Exercise 4.3 (Maximum likelihood and uniform laws) Recall from Example 4.8 our dis- 
cussion of empirical and population risks for maximum likelihood over a family of densities 
{po 0 € Q). 


Dox (X) 
pa(X) 


(a) Compute the population risk R(@, 6*) = Ee | log | in the following cases: 


as 


1+e* 


for x € {0, 1}; 
8% e7 Expl) 


(ii) Poisson: pe(x) = —<,— for x € {0,1,2,...}; 


(i) Bernoulli: pg(x) = 


(iii) multivariate Gaussian: pg is the density of an N(0, X) vector, where the covariance 
matrix X is known and fixed. 


(b) For each of the above cases: 


(i) Letting @ denote the maximum likelihood estimate, give an explicit expression for 
the excess risk E(6, 6") = RO, 6") — inf geo ROO, 6"). 

(ii) Give an upper bound on the excess risk in terms of an appropriate Rademacher 
complexity. 


Exercise 4.4 (Details of symmetrization argument) 


(a) Prove that 


sup E[g(X)] < E 
EEG 


sup sool 
EEG 


Use this to complete the proof of inequality (4.17). 
(b) Prove that for any convex and non-decreasing function ®, 


sup O(E[|eg(X)|]) < E of sup soo| 


EEG EEG 


Use this bound to complete the proof of Proposition 4.11. 


Exercise 4.5 (Necessity of vanishing Rademacher complexity) In this exercise, we work 
through the proof of Proposition 4.12. 
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(a) Recall the recentered function class F = {f -—ELf]| f E€ F}. Show that 


sup rex IEL] 


(b) Use concentration results to complete the proof of Proposition 4.12. 


ExelllSallg] 2 ExelllSalle] = 


Exercise 4.6 (Too many linear classifiers) Consider the function class 
F = {x > sign((, x)) 16 € Rf, Ilêllz = 1}, 


corresponding to the {—1, +1}-valued classification rules defined by linear functions in R°. 
Supposing that d > n, let x} = {x,,...,X,} be a collection of vectors in IR? that are linearly 
independent. Show that the empirical Rademacher complexity satisfies 


LS afe) |=1 


i=1 
Discuss the consequences for empirical risk minimization over the class 5 


RGF (xD/n) = Es |x 
JEF 


Exercise 4.7 (Basic properties of Rademacher complexity) Prove the following properties 
of the Rademacher complexity. 


(a) RAF) = R,(conv( F )). 

(b) Show that R(F +G) < RF) + R,(Y). Give an example to demonstrate that this 
bound cannot be improved in general. 

(c) Given a fixed and uniformly bounded function g, show that 


RAF F + 9) < RF) + Ele, (4.33) 


vn 


Exercise 4.8 (Operations on VC classes) Let S and T be two classes of sets with finite 
VC dimensions. Show that each of the following operations lead to a new set class also with 
finite VC dimension. 


(a) The set class S° := {S° | S € S}, where S° denotes the complement of the set S. 
(b) The set class SAT := {S AT|S ES, TET}. 
(c) The set class SUT := {S UT|S ES, TET}. 


Exercise 4.9 Prove Lemma 4.14. 
Exercise 4.10 Prove equality (i) in equation (4.29), namely that 
ie) Cea) Ce) 
k k-1 k 
Exercise 4.11 In this exercise, we complete the proof of Proposition 4.18. 


(a) Prove inequality (ii) in (4.26). 
(b) Forn > v, prove the sharper upper bound card(S(x7)) < (“)”. (Hint: You might find the 
result of Exercise 2.9 useful.) 
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Exercise 4.12 (VC dimension of left-sided intervals) Consider the class of left-sided half- 
intervals in Rf: 


St (= {(—00, t1] x (—09, ty] X +++ X (—00, ty] | (tis -<-> ta) € R9}. 


Show that for any collection of n points, we have card(S% „(x?)) < (n + 1) and (S2) = d. 


Exercise 4.13 (VC dimension of spheres) Consider the class of all spheres in R?—that is 


S {Sap (a,b) € R? x Ry}, (4.34) 


sphere = 
where S4, := {x € R? | ||x — all2 < b} is the sphere of radius b > 0 centered at a = (a1, a). 


(a) Show that Sik can shatter any subset of three points that are not collinear. 
(b) Show that no subset of four points can be shattered, and conclude that the VC dimension 
is (S2) = 3. 
sphere 


Exercise 4.14 (VC dimension of monotone Boolean conjunctions) For a positive integer 
d > 2, consider the function hs : {0, 1}¢ > {0, 1} of the form 


1 ifx;=1 forall jes, 


0 otherwise. 


hs(x1,..., X4) = i 


The set of all Boolean monomials B4 consists of all such functions as S ranges over all 
subsets of {1,2,...,d}, along with the constant functions h = 0 and h = 1. Show that the VC 
dimension of %4 is equal to d. 


Exercise 4.15 (VC dimension of closed and convex sets) Show that the class C%, of all 
closed and convex sets in R does not have finite VC dimension. (Hint: Consider a set of n 
points on the boundary of the unit ball.) 


Exercise 4.16 (VC dimension of polygons) Compute the VC dimension of the set of all 
polygons in R? with at most four vertices. 


Exercise 4.17 (Infinite VC dimension) For a scalar t € R, consider the function f,(x) = 
sign(sin(tx)). Prove that the function class {f, : [-1, 1] —> R | t € R} has infinite VC dimen- 
sion. (Note: This shows that VC dimension is not equivalent to the number of parameters in 
a function class.) 


5 


Metric entropy and its uses 


Many statistical problems require manipulating and controlling collections of random vari- 
ables indexed by sets with an infinite number of elements. There are many examples of such 
stochastic processes. For instance, a continuous-time random walk can be viewed as a col- 
lection of random variables indexed by the unit interval [0, 1]. Other stochastic processes, 
such as those involved in random matrix theory, are indexed by vectors that lie on the Eu- 
clidean sphere. Empirical process theory, a broad area that includes the Glivenko—Cantelli 
laws discussed in Chapter 4, is concerned with stochastic processes that are indexed by sets 
of functions. 

Whereas any finite set can be measured in terms of its cardinality, measuring the “size” of 
a set with infinitely many elements requires more delicacy. The concept of metric entropy, 
which dates back to the seminal work of Kolmogorov, Tikhomirov and others in the Rus- 
sian school, provides one way in which to address this difficulty. Though defined in a purely 
deterministic manner, in terms of packing and covering in a metric space, it plays a central 
role in understanding the behavior of stochastic processes. Accordingly, this chapter is de- 
voted to an exploration of metric entropy, and its various uses in the context of stochastic 
processes. 


5.1 Covering and packing 


We begin by defining the notions of packing and covering a set in a metric space. Recall that 
a metric space (T, p) consists of a non-empty set T, equipped with a mapping p: T x T > R 
that satisfies the following properties: 


(a) It is non-negative: (0,8) > 0 for all pairs (0,0), with equality if and only if 6 = 0. 
(b) It is symmetric: p(0, 0) = p(0, 0) for all pairs (6, 8). 7 
(c) The triangle inequality holds: p(@, 8) < p(0, 6) + p(6, 0) for all triples (6, 0, 6). 


Familiar examples of metric spaces include the real space R? with the Euclidean metric 


(0,8) = ||0- Ally := (5.1a) 
and the discrete cube {0, 1} with the rescaled Hamming metric 
sr eee 
pu(0,0) := 7 2 IOELA (5.1b) 
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Also of interest are various metric spaces of functions, among them the usual spaces 
L?(u, [0, 1]) with its metric 
1/2 


1 
If = gll2 := f C- du), (5.1c) 


as well as the space C[0, 1] of all continuous functions on [0, 1] equipped with the sup-norm 
metric 


If = gll = “p If) = gl. (5.1d) 


Given a metric space (T, p), a natural way in which to measure its size is in terms of num- 
ber of balls of a fixed radius 6 required to cover it, a quantity known as the covering number. 


Definition 5.1 (Covering number) A 6-cover of a set T with respect to a metric p is 
a set {9',...,@%} c T such that for each @ € T, there exists some i € {1,..., N} such 
that p(6, 6’) < 6. The 6-covering number N(6 ; T, p) is the cardinality of the smallest 
o-cover. 


VES 
ernst 


Figure 5.1 Illustration of packing and covering sets. (a) A 6-covering of T is a col- 
lection of elements {9!,...,@%} c T such that for each @ € T, there is some element 
j€({i,...,N} such that p(0, 6/) < 6. Geometrically, the union of the balls with cen- 
ters @/ and radius 6 cover the set T. (b) A 6-packing of a set T is a collection of 
elements {6!,...,0”} c T such that p(6/, 6") > 6 for all j + k. Geometrically, it is a 
collection of balls of radius 6/2 with centers contained in T such that no pair of balls 
have a non-empty intersection. 
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As illustrated in Figure 5.1(a), a 6-covering can be visualized as a collection of balls of ra- 
dius 6 that cover the set T. When discussing metric entropy, we restrict our attention to metric 
spaces (T, p) that are totally bounded, meaning that the covering number N(6) = N(6; T, p) 
is finite for all 6 > 0. See Exercise 5.1 for an example of a metric space that is not totally 
bounded. 

It is easy to see that the covering number is non-increasing in 6, meaning that N(6) > N(0’) 
for all 6 < 6’. Typically, the covering number diverges as 6 — O*, and of interest to us is this 
growth rate on a logarithmic scale. More specifically, the quantity log N(6 ; T, p) is known 
as the metric entropy of the set T with respect to p. 


Example 5.2 (Covering numbers of unit cubes) Let us begin with a simple example of 
how covering numbers can be bounded. Consider the interval [—1, 1] in R, equipped with 
the metric p(6, 6’) = |0 — 6’|. Suppose that we divide the interval [—1, 1] into L := Li! +1 
sub-intervals,' centered at the points 6’ = —1 + 2(i— 1)6 fori € [L] := {1,2,..., L}, and each 
of length at most 26. By construction, for any point Oe [0, 1], there is some j € [L] such 
that |@/ — 6| < 6, which shows that 


NO (LIL |) < ; Ji (5.2) 


As an exercise, the reader should generalize this analysis, showing that, for the d-dimensional 
cube [—1, 1]¢, we have N(6; [—1, 1]%, |] - lla) < (1 + 1y, & 


Example 5.3 (Covering of the binary hypercube) Consider the binary hypercube H? := 
{0, 1}? equipped with the rescaled Hamming metric (5.1b). First, let us upper bound its 6- 
covering number. Let S = {1,2,...,[(1—6)d]}, where [(1 — 6)d] denotes the smallest integer 
larger than or equal to (1 — 6)d. Consider the set of binary vectors 


T(6):={@€H4|6,;=0 forall j ¢ S}. 


By construction, for any binary vector @ € Hi, we can find a vector 0 € T(6) such that 
PHO, 0) <ô. (Indeed, we can match o exactly on all entries j € S, and, in the worst case, 
disagree on all the remaining |ôd] positions.) Since T(6) contains 2!- vectors, we con- 
clude that 


log Ny(6; H%) 


log? [d(1 — 8). 


This bound is useful but can be sharpened considerably by using a more refined argument, 
as discussed in Exercise 5.3. 

Let us lower bound its 6-covering number, where 6 € (0, 5). If {0',..., 0%} is a 6-covering, 
then the (unrescaled) Hamming balls of radius s = ôd around each 6° must contain all 24 
vectors in the binary hypercube. Let s = |d] denote the largest integer less than or equal 
to ôd. For each 6°, there are exactly È i=0 (‘) binary vectors lying within distance ôd from it, 


and hence we must have N Pa (‘)} > 24, Now let X; € {0, 1} be i.i.d. Bernoulli variables 


' For a scalar a € R, the notation La] denotes the greatest integer less than or equal to a. 
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with parameter 1/2. Rearranging the previous inequality, we have 


1 s d 3 2 (i) ~2d(4-52 
x P -P| si Se May, 
j=0 i=1 
where inequality (i) follows by applying Hoeffding’s bound to the sum of d i.i.d. Bernoulli 
variables. Following some algebra, we obtain the lower bound 


1 2 
log Ny (ô; H°) > 2a(; n o) , valid for ô € (0, $). 


This lower bound is qualitatively correct, but can be tightened by using a better upper 
bound on the binomial tail probability. For instance, from the result of Exercise 2.9, we 
have 4 log P[ ZL X < s] < -DE|| +), where D(6|| +) is the Kullback-Leibler divergence 


between the Bernoulli distributions with parameters 6 and L, respectively. Using this tail 
bound within the same argument leads to the improved lower bound 


log Ny (ô; H3) > dD(6|| 1), valid for 6 € (0, 5)- (5.3) 
& 


In the preceding examples, we used different techniques to upper and lower bound the 
covering number. A complementary way in which to measure the massiveness of sets, also 
useful for deriving bounds on the metric entropy, is known as the packing number. 


Definition 5.4 (Packing number) A -packing of a set T with respect to a metric p 
is a set {6',...,0”} c T such that p(6', 6’) > 6 for all distinct i, j € {1,2,...,M}. The 
6-packing number M(ô ; T, p) is the cardinality of the largest 6-packing. 


As illustrated in Figure 5.1(b), a 6-packing can be viewed as a collection of balls of ra- 
dius 6/2, each centered at an element contained in T, such that no two balls intersect. What 
is the relation between the covering number and packing numbers? Although not identical, 
they provide essentially the same measure of the massiveness of a set, as summarized in the 
following: 


Lemma 5.5 For all 6 > 0, the packing and covering numbers are related as follows: 


(a) ©) 
M(26; T, p) < NO; T, p) < M(ô; T, p). (5.4) 


We leave the proof of Lemma 5.5 for the reader (see Exercise 5.2). It shows that, at least up 
to constant factors, the packing and covering numbers exhibit the same scaling behavior as 
ô —> 0. 


Example 5.6 (Packing of unit cubes) Returning to Example 5.2, we observe that the points 


5.1 Covering and packing 125 


{o/, j= 1,...,L— 1} are separated as |6/ — 6*| > 26 > 6 for all j + k, which implies that 
M26; [-1, 1], |-) = [+]. Combined with Lemma 5.5 and our previous upper bound (5.2), 
we conclude that log N(6; [-1, 1], |-|) x log(1/6) for 6 > 0 sufficiently small. This argument 
can be extended to the d-dimensional cube with the sup-norm || - Il, showing that 


log N(ô ; [0, 17, I+ Ilo) X d log(1/6) for ô > 0 sufficiently small. (5.5) 


Thus, we see how an explicit construction of a packing set can be used to lower bound the 
metric entropy. 4 


In Exercise 5.3, we show how a packing argument can be used to obtain a refined upper 
bound on the covering number of the Boolean hypercube from Example 5.3. 


We now seek some more general understanding of what geometric properties govern met- 
ric entropy. Since covering is defined in terms of the number of balls—each with a fixed 
radius and hence volume—one would expect to see connections between covering numbers 
and volumes of these balls. The following lemma provides a precise statement of this con- 
nection in the case of norms on R? with open unit balls, for which the volume can be taken 
with respect to Lebesgue measure. Important examples are the usual f,-balls, defined for 
q € [1,0] via 


Ba(1) := {x € R° | |lall, < 1} (5.6) 


where for q € [1, œ), the €,-norm is given by 


d 1/4 
(> st for q € [1, 0), 
i=1 


max, lx;l for q = œ. 


(5.7) 


Illy = 


The following lemma relates the metric entropy to the so-called volume ratio. It involves the 
Minkowski sum A + B := {a +b | a € A, b € B} of two sets. 


Lemma 5.7 (Volume ratios and metric entropy) Consider a pair of norms ||: || and ||- || 
on Rf, and let B and B' be their corresponding unit balls (i.e., B = {9 € R? | |lAl| < 1}, 
with B’ similarly defined). Then the -covering number of B in the ||- || -norm obeys the 
bounds 


ô 


< NO; B, II-I) < 


vol(B’) ~ vol(B’) Oe) 


(:) vol(B) © b) vol(2B + B’) 


Whenever B’ C B, the upper bound (b) may be simplified by observing that 


d 
voi(5 B+ J < vol (5 + 1) 3) = G + 1) vol(B), 
ô ô ô 


which implies that N(6; B, II-I) < 0 + D 2e. 
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Proof On one hand, if {6',..., 6%} is a 6-covering of B, then we have 


N 
BC Je + 6B’, 
jel 


which implies that vol(B) < N vol(6B’) = Nô? vol(B’), thus establishing inequality (a) in the 
claim (5.8). 

In order to establish inequality (b) in (5.8), let {6',...,0”} be a maximal (6/2)-packing 
of E in the || - ||’-norm; by maximality, this set must also be a 6-covering of B under the 
|| - /-norm. The balls {6/ + $B’, j = 1,...,M} are all disjoint and contained within B + $B’. 
Taking volumes, we conclude that ies vol(6/ + $B’) < vol(B + $B’), and hence 


mva ($ B’) < voi (B+ $8’). 
2 2 


Finally, we have vol(2B’) = (2)4 vol(B’) and vol(B + $B’) = (ê) vol(2B + B’), from which 
the claim (b) in equation (5.8) follows. 


Let us illustrate Lemma 5.7 with an example. 


Example 5.8 (Covering unit balls in their own metrics) As an important special case, if we 
take B = B’ in Lemma 5.7, then we obtain upper and lower bounds on the metric entropy of 
a given unit ball in terms of its own norm—namely, we have 


dlog(1/6) < log N(6; B, || - ||) < doe + 5), (5.9) 


When applied to the £~» -norm, this result shows that the ||-||..-metric entropy of B4, = [-1, 1]? 
scales as d log(1/6), so that we immediately recover the end result of our more direct analysis 
in Examples 5.2 and 5.6. As another special case, we also find that the Euclidean unit ball 
BS can be covered by at most (1 + 2/6)“ balls with radius 6 in the norm ||- ||. In Example 5.12 
to follow in the sequel, we use Lemma 5.7 to bound the metric entropy of certain ellipsoids 
in (N). 4 


Thus far, we have studied the metric entropy of various subsets of R“. We now turn to 
the metric entropy of some function classes, beginning with a simple parametric class of 
functions. 


Example 5.9 (A parametric class of functions) For any fixed 6, define the real-valued 
function f(x) := 1 — e~*, and consider the function class 


P :={fo: [0,1] > R | 8 € [0, 1]}. 


The set FY is a metric space under the uniform norm (also known as the sup-norm) given 
by |f — glloo := SUP xeqo,1) fœ — g()|. We claim that its covering number in terms of the 
sup-norm is bounded above and below as 


1-1/e|@ Gi) 1 
< : je 
1 +| 75 | < Nx; P) < z5 +2. (5.10) 


We first establish the upper bound given in inequality (ii) of (5.10). For a given ô € (0, 1), 
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let us set T = Lā], and define 6 := 26i for i = 0,1,...,7. By also adding the point 67*! = 1, 
we obtain a collection of points {0°,...,67,67*'} contained within [0, 1]. We claim that the 
associated functions {fgo,..., for} form a -cover for Z. Indeed, for any fy € Z, we can 
find some 6! in our cover such that |6! — 6| < 5. We then have 


lfa = follo = max le — e] < Jo - 1 < ô, 


which implies that N~ (6; Y) <T +2 < x TOA 

In order to prove the lower bound on the covering number, as stated in inequality (i) in 
(5.10), we proceed by first lower bounding the packing number, and then applying Lemma 5.5. 
An explicit packing can be constructed as follows: first set 6° = 0, and then define 6’ = 
—log(1 — 6i) for all i such that 6’ < 1. We can define 6! in this way until 1/e = 1 — To, 
or T > |“). Moreover, note that for any i + j in the resulting set of functions, we 
have || fa — fallo = Lf) — fai(1)| = 6, by definition of 6’. Therefore, we conclude that 


M..(6; P) > | 2“ ] + 1, and hence that 


1-1 
Nai P) = M28; P) >| jsa, 
as claimed. We have thus established the scaling log N(6; Y, I|- Ilo) x log(1/6) as 6 > 0+. 
This rate is the typical one to be expected for a scalar parametric class. 4 


A function class with a metric entropy that scales as log(1/6) as 6 > 0* is relatively small. 
Indeed, as shown in Example 5.2, the interval [-1, 1] has metric entropy of this order, and 
the function class Y from Example 5.9 is not essentially different. Other function classes 
are much richer, and so their metric entropy exhibits a correspondingly faster growth, as 
shown by the following example. 


Example 5.10 (Lipschitz functions on the unit interval) Now consider the class of Lips- 
chitz functions 


Fy, := {g: [0,1] > R | (0) = 0, and [g(x) - g(x] < Lx- x| Yx,x €[0,1]}. (5.11) 


Here L > 0 isa fixed constant, and all of the functions in the class obey the Lipschitz bound, 
uniformly over all of [0, 1]. Note that the function class Y from Example 5.9 is contained 
within the class F; with L = 1. It is known that the metric entropy of the class F, with 
respect to the sup-norm scales as 


log N.o(6 ; Fz) = (L/6) for suitably small 6 > 0. (5.12) 


Consequently, the set of Lipschitz functions is a much larger class than the parametric func- 
tion class from Example 5.9, since its metric entropy grows as 1/6 as 6 — 0, as compared 
to log(1/6). 

Let us prove the lower bound in equation (5.12); via Lemma 5.5, it suffices to construct a 
sufficiently large packing of the set Fz. For a given e > 0, define M = |1/e], and consider 
the points in [0, 1] given by 


x; =(i- le, fori=1,...,M, and xy.) =Me<l. 
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Figure 5.2 The function class {fg, 8 € {-1,+1}”} used to construct a packing 
of the Lipschitz class Fz. Each function is piecewise linear over the intervals 
[0, €], [e, 2e], ..., [(M—1)e, Me] with slope either +L or —L. There are 2” functions 
in total, where M = [1/e]. 


Moreover, define the function ¢: R > R, via 


0 foru <0, 
ġ(u):=4u foru e€ [0,1], (5.13) 


1 otherwise. 


For each binary sequence £ € {—1, +1}, we may then define a function fs mapping the unit 
interval [0, 1] to [—L, +L] via 


M 
fol) = Jeke =); (5.14) 


By construction, each function fg is piecewise linear and continuous, with slope either +L 
or —L over each of the intervals [e(i— 1), ei] fori = 1,..., M, and constant on the remaining 
interval [Me, 1]; see Figure 5.2 for an illustration. Moreover, it is straightforward to verify 
that f3(0) = 0 and that fg is Lipschitz with constant L, which ensures that fg € FL. 

Given a pair of distinct binary strings 6 + £’ and the two functions fg and fy, there is at 
least one interval where the functions start at the same point, and have the opposite slope 
over an interval of length e. Since the functions have slopes +L and —L, respectively, we 
are guaranteed that ||fs — fyllo = 2Le, showing that the set {fg, 8 € {-1,+1}"} forms a 
2Le packing in the sup-norm. Since this set has cardinality 2” = 2!'/4, after making the 
substitution € = 6/L and using Lemma 5.5, we conclude that 


log N(6; Fz, Il- llo) & L/6. 


With a little more effort, it can also be shown that the collection of functions { fz, 
B € {-1,+1}"”} defines a suitable covering of the set F,, which establishes the overall 
claim (5.12). & 
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The preceding example can be extended to Lipschitz functions on the unit cube in higher 
dimensions, meaning real-valued functions on [0, 1]¢ such that 


If) -= fO < Lllx—ylloo for all x,y € [0, 11%, (5.15) 


a class that we denote by F; ([0, 1]“). An extension of our argument can then be used to 
show that 


log No (ô; Fx((O, 1] = (L/6)4. 


It is worth contrasting the exponential dependence of this metric entropy on the dimension 
d, as opposed to the linear dependence that we saw earlier for simpler sets (e.g., such as 
d-dimensional unit balls). This is a dramatic manifestation of the curse of dimensionality. 


Another direction in which Example 5.10 can be extended is to classes of functions that have 
higher-order derivatives. 


Example 5.11 (Higher-order smoothness classes) We now consider an example of a func- 
tion class based on controlling higher-order derivatives. For a suitably differentiable func- 
tion f, let us adopt the notation f® to mean the kth derivative. (Of course, f® = f in 
this notation.) For some integer œ and parameter y € (0, 1], consider the class of functions 
f: [0,1] — R such that 


FOOI C; for all x € [0,1], j =0,1,...,@, (5.16a) 
FO- fO Lle- x, — forall x,x’ € [0,1]. (5.16b) 


We claim that the metric entropy of this function class -Fa scales as 


1 
log N(ô ; Fay, Il- Ilo) = G) l (5.17) 
(Here we have absorbed the dependence on the constants C; and L into the order notation.) 
Note that this claim is consistent with our calculation in Example 5.10, which is essentially 
the same as the class Fo. 

Let us prove the lower bound in the claim (5.17). As in the previous example, we do so 
by constructing a packing {fg, 8 € {-1,+1}”} for a suitably chosen integer M. Define the 
function 


2(at+y).,a+ — y\a+ 
re ‘a Py] — yy” fory € [0, 1], eis 


0 otherwise. 


If the pre-factor c is chosen small enough (as a function of the constants C; and L), it can 
be seen that the function ¢ satisfies the conditions (5.16). Now for some e > 0, let us set 
ô = (e/c)'/“*”, By adjusting c as needed, this can be done such that M := [1/6] < 1/6, so 
that we consider the points in [0, 1] given by 


x,;=(i-1)6, fori=1,...,M, and xm, =Mô< 1. 


130 Metric entropy and its uses 


For each £ € {-1, +1}™, let us define the function 


fal) := ye guterng( 58 =) (5.19) 


and note that it also satisfies the conditions (5.16). Finally, for two binary strings 6 + p’, 
there must exist some i € {1,..., M} and an associated interval /;_; = [x;-1, x;] such that 


OROL a 5) for all x € Ij). 


By setting x = x; + 6/2, we see that 
lfs — Fell 2 2c 8” = 2e, 
so that the set {fg, 6 € {-1, +1}"} is a 2e-packing. Thus, we conclude that 
log NCS; Fay, Il leo) Z (1/5) x A/a”, 
as claimed. 4 


Various types of function classes can be defined in terms of orthogonal expansions. Con- 
cretely, suppose that we are given a sequence of functions (¢;);°, belonging to L’[0, 1] and 
such that 

à 1 ifi=j, 
(Gi, peon = | ipa) dx = . 
0 O otherwise. 
For instance, the cosine basis is one such orthonormal basis, and there are many other 
interesting ones. Given such a basis, any function f € L?[0,1] can be expanded in the 
form f = Èi 6;¢;, where the expansion coefficients ae given by the inner products 
0; = (f, ġ;)}. By Parseval’s theorem, we have ILA = Din 8 1 9; 2 so that ||fll2 < co if and only if 
0) E E(N), the space of all square summable sequences. Various interesting classes of 
functions can be obtained by imposing additional constraints on the class of sequences, and 
one example is that of an ellipsoid constraint. 


Example 5.12 (Function classes based on ellipsoids in £7(IN)) Given a sequence of non- 
negative real numbers (j1;)°, such that })' | 4; < œ, consider the ellipsoid 


gz fon i 


Such ellipsoids play an important role in our discussion of reproducing kernel Hilbert spaces 
(see Chapter 12). In this example, we study the ellipsoid specified by the sequence u; = j? 
for some parameter a > 1/2. Ellipsoids of this type arise from certain classes of a-times- 
differentiable functions; see Chapter 12 for details. 

We claim that the metric entropy of the associated ellipsoid with respect to the norm 
Il- Ib =I llea scales as 


o0 @2 


Ds 1 1} c AN). (5.20) 


m Kj 


1 l/a 
log N(6; &, || « |l2) = (5) for all suitably small 6 > 0. (5.21) 
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We begin by proving the upper bound—in particular, for a given 6 > 0, let us upper bound? 
the covering number N(V26). Let d be the smallest integer such that ua < 6’, and consider 
the truncated ellipsoid 


&:=(0€6|6,=0 forall j>d+1}. 


We claim that any 6-cover of this truncated ellipsoid, say {0!,..., 0%}, forms a /26-cover of 
the full ellipsoid. Indeed, for any 6 € &, we have 


oœ co G2 
D Ø < pa ` eee 
j=d+1 j=d+1 J 


and hence 
min lle - & ||; = an nde — ey + X 6 < 26°. 
j=d+1 


Consequently, it suffices to upper bound the cardinality N of this covering of &. Since 
e< L; for all j € {1,...,d}, if we view & as a subset of IR’, then it contains the ball BS(6), 
and hence vol(E + B46 /2)) < vol(2E). Consequently, by Lemma 5.7, we have 

d co d d pd 

l BS (8/2 
N< z) vol(& + B5(6/2)) š *) vol(&) 
6 vol(B§(1)) vol( vol(B&(1))” 


ô 


: : vol() pad d : p 
By standard formulae for the volume of ellipsoids, we have vey = []j-1 vi. Putting 


together the pieces, we find that 
d P d 
log N < dlog(4/6) + 5 2 log u; = dlog(4/6) - o>, log j, 


where step (i) follows from the substitution u; = j”. Using the elementary inequality 
he log j = dlogd — d, we have 


log N < d(log4 + a) + d{log(1/6) — a logd} < d(log 4 + a), 


where the final inequality follows since ua = d? < 6’, which is equivalent to log(4) < 
a log d. Since (d — 1)? > 6*, we have d < (1/6)!/ + 1, and hence 


1 


log N < {(3) + i} oes +a), 


which completes the proof of the upper bound. YA 
For the lower bound, we note that the ellipsoid & contains the truncated ellipsoid 6, which 
(when viewed as a subset of R^) contains the ball BS(6). Thus, we have 


ô 
boen(5 ;6,||- 5 2 toen(5: B3(6), Il 5 > dlog 2, 


where the final inequality uses the lower bound (5.9) from Example 5.8. Given the inequality 
d > (1/6)'/”, we have established the lower bound in our original claim (5.21). % 


2 The additional factor of V2 is irrelevant for the purposes of establishing the claimed scaling (5.21). 
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5.2 Gaussian and Rademacher complexity 


Although metric entropy is a purely deterministic concept, it plays a fundamental role in 
understanding the behavior of stochastic processes. Given a collection of random variables 
{Xo, 0 € T} indexed by T, it is frequently of interest to analyze how the behavior of this 
stochastic process depends on the structure of the set T. In the other direction, given know- 
ledge of a stochastic process indexed by T, it is often possible to infer certain properties of 
the set T. In our treatment to follow, we will see instances of both directions of this interplay. 


An important example of this interplay is provided by the stochastic processes that define 


the Gaussian and Rademacher complexities. Given a set T c Rf, the family of random 
variables {G, 0 € T}, where 


d 
Go := (w, 6) = >; wiGj, with w; ~ N(0, 1), ii.d., (5.22) 
i=l 


defines a stochastic process is known as the canonical Gaussian process associated with T. 
As discussed earlier in Chapter 2, its expected supremum 


Gl) :=E supe w| (5.23) 


6eT 


is known as the Gaussian complexity of T. Like the metric entropy, the functional G(T) 
measures the size of the set T in a certain sense. Replacing the standard Gaussian variables 
with random signs yields the Rademacher process {Ro, 0 € T}, where 


d 
Rg := (e, 0} = z Eii, with z; uniform over {—1, +1}, iid. (5.24) 
i=l 
Its expectation R(T) := E[sup,_7 (8, €)] is known as the Rademacher complexity of T. As 


shown in Exercise 5.5, we have R(T) < VZ G(T) for any set T, but there are sets for which 
the Gaussian complexity is substantially larger than the Rademacher complexity. 


Example 5.13 (Rademacher/Gaussian complexity of Euclidean ball BS) Let us compute 
the Rademacher and Gaussian complexities of the Euclidean ball of unit norm—that is, 
BS = {6 € R? | ||@l, < 1}. Computing the Rademacher complexity is straightforward: 


indeed, the Cauchy—Schwarz inequality implies that 


R(B4) = | sup (0, | = ($e) | = Vad. 


i=1 


The same argument shows that G( BY = E[||w|l2] and by concavity of the square-root func- 
tion and Jensen’s inequality, we have 


Ellwll < «fELiwl3] = va, 


so that we have the upper bound G(B4) < Vd. On the other hand, it can be shown that 
Ellwll, > Vd (1 — o(1)). This is a good exercise to work through, using concentration bounds 
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for x? variates from Chapter 2. Combining these upper and lower bounds, we conclude that 


G(B4)/Vd = 1 - 0(1), (5.25) 


so that the Rademacher and Gaussian complexities of B? are essentially equivalent. & 


Example 5.14 (Rademacher/Gaussian complexity of B?) As a second example, let us con- 
sider the ¢,-ball in d dimensions, denoted by B4, By the duality between the ¢,- and £%- 
norms (or equivalently, using Hölder’s inequality), we have 


RB) = E sup (6, e)l = Efllello] = 1. 


lâl <1 


Similarly, we have G( BY) = E[||wll.o], and using the result of Exercise 2.11 on Gaussian 
maxima, we conclude that 


G(B4)/V2logd = 1 + 0(1). (5.26) 


Thus, we see that the Rademacher and Gaussian complexities can differ by a factor of the 
order ./log d; as shown in Exercise 5.5, this difference turns out to be the worst possible. 
But in either case, comparing with the Rademacher/Gaussian complexity of the Euclidean 
ball (5.25) shows that the ¢,-ball is a much smaller set. & 


Example 5.15 (Gaussian complexity of €-balls) We now turn to the Gaussian complexity 
of a set defined in a combinatorial manner. As we explore at more length in later chapters, 
sparsity plays an important role in many classes of high-dimensional statistical models. The 
€\-norm, as discussed in Example 5.14, is a convex constraint used to enforce sparsity. A 


more direct and combinatorial way* is by limiting the number ||4]|o := paper 1[6; # 0] of 
non-zero entries in 6. For some integer s € {1,2,...,d}, the f-“ball” of radius s is given by 
Ba(s) := {8 € R? | |I@llo < s}. (5.27) 


This set is non-convex, corresponding to the union of (‘) subspaces, one for each of the 
possible s-sized subsets of d coordinates. Since it contains these subspaces, it is also an 
unbounded set, so that, in computing any type of complexity measure, it is natural to impose 
an additional constraint. For instance, let us consider the Gaussian complexity of the set 


S%(s) := BS(s) n BS) = {8 € R° | [|6llo < s, and |[Ol|2 < 1}. (5.28) 


Exercise 5.7 leads the reader through the steps required to establish the upper bound 


G(S“(s)) X 4f slog =, (5.29) 


where e ~ 2.7183 is defined as usual. Moreover, we show in Exercise 5.8 that this bound is 
tight up to constant factors. + 


The preceding examples focused on subsets of vectors in R°. Gaussian complexity also 
plays an important role in measuring the size of different classes of functions. For a given 


3 Despite our notation, the f-“norm” is not actually a norm in the usual sense of the word. 
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class ¥ of real-valued functions with domain X, let Xf = {X1,...,X,} be a collection of n 
points within the domain, known as the design points. We can then define the set 
F NG On SO fay fe) SR’ (5.30) 


Bounding the Gaussian complexity of this subset of R” yields a measure of the complexity 
of F at scale n; this measure plays an important role in our analysis of nonparametric least 
squares in Chapter 13. 


It is most natural to analyze a version of the set F (x?) that is rescaled, either by n~'/* or 
F(x) 


by n7!. It is useful to observe that the Euclidean metric on the rescaled set ‘ a 


corresponds 


to the empirical L?(P,)-metric on the function space F —viz. 


If- alle = 4] — Dre) z (5.31) 


Note that, if the function class F is uniformly bounded (i.e., ||fllo < b for all f € F), then 
we also have || fll, < b for all f € F. In this case, we always have the following (trivial) 
sep Wi fœ] < p ELMI] 
yn 


upper bound 
ma: 
EPE 


where we have recalled our analysis of E[||w||2] from Example 5.13. Thus, a bounded func- 
tion class (evaluated at n points) has Gaussian complexity that is never larger than a (scaled) 


Euclidean ball in R”. A more refined analysis will show that the Gaussian complexity of 
F Fy) ). 


<b, 


is often substantially smaller, depending on the structure of F. We will study many 
instances of such refined bounds in the sequel. 


5.3 Metric entropy and sub-Gaussian processes 


Both the canonical Gaussian process (5.22) and the Rademacher process (5.24) are particu- 
lar examples of sub-Gaussian processes, which we now define in more generality. 


Definition 5.16 A collection of zero-mean random variables {X,, 0 € T} is a sub- 
Gaussian process with respect to a metric px on T if 


2% 00) 8) 


Eje% ®] <e 7 forall, eT, and 4€ R. (5.32) 


By the results of Chapter 2, the bound (5.32) implies the tail bound 


È 
PIIXo — Xgl > A) < 2e A, 


and imposing such a tail bound is an equivalent way in which to define a sub-Gaussian 
process. It is easy to see that the canonical Gaussian and Rademacher processes are both 
sub-Gaussian with respect to the Euclidean metric ||8 — 6|l2. 
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Given a sub-Gaussian process, we use the notation Ny (6; T) to denote the 6-covering 
number of T with respect to px, and N2 (5; T) to denote the covering number with respect 
to the Euclidean metric ||- ||2, corresponding to the case of a canonical Gaussian process. As 
we now discuss, these metric entropies can be used to construct upper bounds on various 
expected suprema involving the process. 


5.3.1 Upper bound by one-step discretization 


We begin with a simple upper bound obtained via a discretization argument. The basic idea 
is natural: by approximating the set T up to some accuracy 6, we may replace the supre- 
mum over T by a finite maximum over the 6-covering set, plus an approximation error that 
scales proportionally with 6. We let D := sUPgzer Px(9, 6) denote the diameter of T, and let 
Nyx (6; T) denote the 6-covering number of T in the py-metric. 


Proposition 5.17 (One-step discretization bound) Let {X,, 0 € T} be a zero-mean 
sub-Gaussian process with respect to the metric px. Then for any 6 € [0, D] such that 
Nx (6; T) = 10, we have 


e| sup (Xo — Xp) <2E| sup (X,-X,)| +4 yD? log Nx (6; T). (5.33) 
0,0eT a 
PXYVY)S 


Remarks: It is convenient to state the upper bound in terms of the increments Xg — Xz so as 
to avoid issues of considering where the set T is centered. However, the claim (5.33) always 
implies an upper bound on E[sup,,.7 Xo], since the zero-mean condition means that 


< E | sup(Xo — x9 


F sup Xa =f | sup(Xo — Xa) 
deT 6,0€T 


6eT 


For each ô € [0, D], the upper bound (5.33) consists of two quantities, corresponding to 
approximation error and estimation error, respectively. As 6 — 0*, the approximation error 
(involving the constraint py(y,y’) < ô) shrinks to zero, whereas the estimation error (in- 
volving the metric entropy) grows. In practice, one chooses 6 so as to achieve the optimal 
trade-off between these two terms. 


Proof For a given 6 > 0 and associated covering number N = Nx (6; T), let {6',...,0%} 
be a 6-cover of T. For any @ € T, we can find some 6! such that px(0, 6’) < 6, and hence 


Xo = Xo = (Xo = Xai) + (Xoi = Xo) 
< sup (X,—-Xy)+ max |Xa — Xal. 
yet i=1,2,...,N 
Px(sy')SO 


Given some other arbitrary 6 € T, the same upper bound holds for Xa: — Xz, so that adding 
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together the bounds, we obtain 


sup(Xp— XG) <2 sup (X, -Xy)+2 max |X, — Xal. (5.34) 
OBET yy'eT i=1,2 sane N 
Px.’ )S6 
Now by assumption, for each i = 1,2,...,N, the random variable Xg — X¢ is zero-mean 


and sub-Gaussian with parameter at most px(6',6') < D. Consequently, by the behavior of 
sub-Gaussian maxima (see Exercise 2.12(c)), we are guaranteed that 


< 2D? log N, 


jas 


which yields the claim. 


In order to gain intuition, itis worth considering the special case of the canonical Gaussian 
(or Rademacher) process, in which case the relevant metric is the Euclidean norm ||0 — Ol 2- 
In order to reduce to the essential aspects of the problem, consider a set T that contains the 
origin. The arguments leading to the bound (5.33) imply that the Gaussian complexity G(T) 
is upper bounded as 


GT) < min {G(T(5)) + 2 yD? log N: (5; T)}, (5.35) 
6[0,D] 
where N: (6; T) is the 6-covering number in the f2-norm, and 
TO := y -y Inv ET, lly -ylk < ô). 


The quantity G(1(6)) is referred to as a localized Gaussian complexity, since it measures 
the complexity of the set T within an f-ball of radius 6. This idea of localization plays an 
important role in obtaining optimal rates for statistical problems; see Chapters 13 and 14 
for further discussion. We note also that analogous upper bounds hold for the Rademacher 
complexity R(T) in terms of a localized Rademacher complexity. 


In order to obtain concrete results from the discretization bound (5.35), it remains to upper 
bound the localized Gaussian complexity, and then optimize the choice of 6. When T is a 
subset of R“, the Cauchy—Schwarz inequality yields 


< ô Efilwlb] < 6 Vd, 


G(T(5)) = :| sup (0, w) 


6eT(6) 


which leads to the naive discretization bound 


GT) < min {6 Vd + 2 YD" log N: (ô; T)}. (5.36) 


For some sets, this simple bound can yield useful results, whereas for other sets, the local 
Gaussian (or Rademacher) complexity needs to be controlled with more care. 


4 In this case, the argument can be refined so as to remove a factor of 2. 
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5.3.2 Some examples of discretization bounds 


Let us illustrate the use of the bounds (5.33), (5.35) and (5.36) with some examples. 


Example 5.18 (Gaussian complexity of unit ball) Recall our discussion of the Gaussian 
complexity of the Euclidean ball BS from Example 5.13: using direct methods, we proved 


the scaling G(B2) = Vd (1 — 0(1)). The purpose of this example is to show that Proposi- 
tion 5.17 yields an upper bound with this type of scaling (albeit with poor control of the 
pre-factor). In particular, recall from Example 5.8 that the metric entropy number of the Eu- 
clidean ball is upper bounded as log N>(6; BS) < dlog(1 + 2), Thus, setting 6 = 1/2 in the 
naive discretization bound (5.36), we obtain 


G(BS) < Vai + 2J210g5}. 


Relative to the exact result, the constant in this result is sub-optimal, but it does have the 
correct scaling as a function of d. + 


Example 5.19 (Maximum singular value of sub-Gaussian random matrix) As a more sub- 
stantive demonstration of Proposition 5.17, let us show how it can be used to control the 
expected f>-operator norm of a sub-Gaussian random matrix. Let W € R’“ be a random 
matrix with zero-mean i.i.d. entries W;;, each sub-Gaussian with parameter o = 1. Exam- 
ples include the standard Gaussian ensemble W;; ~ N(0, 1), and the Rademacher ensemble 
W;; € {-1,+1} equiprobably. The ¢,-operator norm (or spectral norm) of the matrix W is 
given by its maximum singular value; equivalently, it is defined as |||W]llz := sup esa- ||Wvll2, 
where S¢! = {v € R? | |ivll2 = 1} is the Euclidean unit sphere in Rt. Here we sketch out an 
approach for proving the bound 


d 
E[|Wil2/va] x 1+ 2 


a 
leaving certain details for the reader in Exercise 5.11. 
Let us define the class of matrices 
M"“(1) := {0 € R” | rank(®) = 1, IlOll- = 1}, (5.37) 


corresponding to the set of n x d matrices of rank one with unit Frobenius norm IOI = 
Sia ye OF As verified in Exercise 5.11(a), we then have the variational representation 


n d 
IWI = sup Xo, where Xo :=(W, ®) =) >, WO; (5.38) 


@c"™4(1) al jel 


In the Gaussian case, this representation shows that E[IIWI]l2] is equal to the Gaussian 
complexity G(M"“(1)). For any sub-Gaussian random matrix, we show in part (b) of Ex- 
ercise 5.11 that the stochastic process {X@, ®© € M"“(1)} is zero-mean, and sub-Gaussian 
with respect to the Frobenius norm |||@ — ©’ |||7. Consequently, Proposition 5.17 implies that, 
for all 6 € [0, 1], we have the upper bound 


FEIW] < 2 E sup «EL-I, W)} + 6 vlog Np(5; M»4(1)), (5.39) 
rank(P)=rank(I’)=1 
I-T- <8 
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where N;-(6;M"“(1)) denotes the 6-covering number in Frobenius norm. In part (c) of Exer- 
cise 5.11, we prove the upper bound 


sup «E-I, W| < V26ETIIWibI, (5.40) 
rank(T)=rank(T’)=1 
IV" Ile so 
and in part (d), we upper bound the metric entropy as 
2 
log N(6; 1""(1)) < (n + d) log (1 + z) l (5.41) 


Substituting these upper bounds into inequality (5.39), we obtain 


2 
E[ Wo] < min V2.6 EMW] + 6 fe + d) log (i + z) : 
ôe[0,1] ô 


Fixing ô = a (as one particular choice) and rearranging terms yields the upper bound 


1 d 
a [I Will2] < afi k {| 


for some universal constant cı > 1. Again, this yields the correct scaling of E[|l|W\ll2] as a 
function of (n,d). As we explore in Exercise 5.14, for Gaussian random matrices, a more 
refined argument using the Sudakov—Fernique comparison can be used to prove the upper 
bound with cı} = 1, which is the best possible. In Example 5.33 to follow, we establish a 
matching lower bound of the same order. 4 


Let us now turn to some examples of Gaussian complexity involving function spaces. Re- 
call the definition (5.30) of the set F (x]) as well as the empirical L?-norm (5.31). As a 
consequence of the inequalities 

If = alls < max |f) = 8) < If ~ Bll 


Seb n 


we have the following relations among metric entropies: 
log No(6; F (x})/Vn) < log Noo (6; F) < log Nê; F, Il- lhe), (5.42) 
which will be useful in our development. 


Example 5.20 (Empirical Gaussian complexity for a parametric function class) Let us 
bound the Gaussian complexity of the set Y(x/)/n generated by the simple parametric func- 
tion class Y from Example 5.9. Using the bound (5.42), it suffices to control the €..-covering 
number of “. From our previous calculations, it can be seen that, as long as 6 < 1/4, we 
have log N~ (6; Z) < log(1/6). Moreover, since the function class is uniformly bounded 
G.e., [fll < 1 for all f € A), the diameter in empirical L?-norm is also well-controlled—in 
particular, we have D? = sup eP 1 D f(x) < 1. Consequently, the discretization bound 
(5.33) implies that 


1, 
GQA(X")/n) < T dni [6 vn + 3 Viog1/®}. 
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In order to optimize the scaling of the bound, we set 6 = 1/(4 Vn), and thereby obtain the 


upper bound 
l 
AP DIN) Z EE, (5.43) 


As we will see later, the Gaussian complexity for this function class is actually upper 
bounded by 1/yn, so that the crude bound from Proposition 5.17 captures the correct be- 
havior only up to a logarithmic factor. We will later develop more refined techniques that 
remove this logarithmic factor. & 


Example 5.21 (Gaussian complexity for smoothness classes) Now recall the class F; of 
Lipschitz functions from Example 5.10. From the bounds on metric entropy given there, as 
long as 6 € (0,60) for a sufficiently small 69 > 0, we have log Næ (6; ¥1) < ch for some 
constant c. Since the functions in F; are uniformly bounded by one, the discretization bound 
implies that 


Ge. n 1 : cL 
GA L(x})/n) < aA san [oxn +3 4 = . 


To obtain the tightest possible upper bound (up to constant factors), we set 6 = n7'/3, and 
hence find that 


GFL n) Z n’. (5.44) 


By comparison to the parametric scaling (5.43), this upper bound decays much more 
slowly. & 


5.3.3 Chaining and Dudley’s entropy integral 


In this section, we introduce an important method known as chaining, and show how it can be 
used to obtain tighter bounds on the expected suprema of sub-Gaussian processes. Recall the 
discretization bound from Proposition 5.17: it was based on a simple one-step discretization 
in which we replaced the supremum over a large set with a finite maximum over a 6-cover 
plus an approximation error. We then bounded the finite maximum by combining the union 
bound with a sub-Gaussian tail bound. In this section, we describe a substantial refinement 
of this procedure, in which we decompose the supremum into a sum of finite maxima over 
sets that are successively refined. The resulting procedure is known as the chaining method. 

In this section, we show how chaining can be used to derive a classical upper bound, 
originally due to Dudley (1967), on the expected supremum of a sub-Gaussian process. In 
Section 5.6, we show how related arguments can be used to control the probability of a 
deviation above this expectation. Let {X, 6 € T} be a zero-mean sub-Gaussian process with 
respect to the (pseudo)metric px (see Definition 5.16). Define D = supgger px(0,0), and the 
ô-truncated Dudley’s entropy integral 


D 
J (6; D) al vlog Ny (u; T)du, (5.45) 
5 
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where we recall that Ny (uw; T) denotes the 6-covering number of T with respect to py. 


Theorem 5.22 (Dudley’s entropy integral bound) Let {Xo, 0 € T} be a zero-mean sub- 
Gaussian process with respect to the induced pseudometric px from Definition 5.16. 
Then for any 6 € [0, D], we have 


e| sup (Xo — Xp) <2E| sup (X, -Xy)| +32 J(6/4; D). (5.46) 
0,0eT an ; 
Px(VsY JS 


H 


Remarks: There is no particular significance to the constant 32, which could be improved 
with a more careful analysis. We have stated the bound in terms of the increment Xa — Xz, but 
it can easily be translated into an upper bound on E[sup,.7 Xo]. (See the discussion following 
Proposition 5.17.) The usual form of Dudley’s bound corresponds to the case 6 = 0, and so 
is in terms of the entropy integral .f(0; D). The additional flexibility to choose 6 € [0, D] 
can be useful in certain problems. 


Proof We begin with the inequality (5.34) previously established in the proof of Proposi- 
tion 5.17—namely, 


sup(Xp — X4) <2 sup (X,—-Xy)+2 max |X — Xal. 
abet ‘eT eal 


px(ys7')s6 

In the proof of Proposition 5.17, we simply upper bounded the maximum over i = 1,...,N 
using the union bound. In this proof, we pursue a more refined chaining argument. Define 
U = {6',...,@%}, and for each integer m = 1,2,...,L, let U,, be a minimal €,, = D2™ cov- 
ering set of U in the metric py, where we allow for any element of T to be used in forming 
the cover. Since U is a subset of T, each set has cardinality N„ := |U,,| upper bounded as 
Nm < Nx (En; T). Since U is finite, there is some finite integer L for which U; = U. (In 
particular, for the smallest integer such that N; = |U|, we can simply choose U; = U.) For 
each m = 1,..., L, define the mapping nm: U > U,, via 


Tm(8) = arg min px, 8), 
BEV, 


so that 7,,,(0) is the best approximation of 6 € U from the set U,,. Using this notation, we 
can decompose the random variable X, into a sum of increments in terms of an associ- 


ated sequence (y',...,y"), where we define y = 6 andy”! := mm-\(y") recursively for 
m=L,L—-1,...,2. By construction, we then have the chaining relation 
L 
Xo - Xy = X (Xy - Xt), (5.47) 
m=2 


and hence |X; — X,:| < 5s mMaxgeu 
set-up. 
Thus, we have decomposed the difference between Xe and the final element X, in its 


|Xg — Xz, (gl. See Figure 5.3 for an illustration of this 


m 
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Uz 


0 


Figure 5.3 Illustration of the chaining relation in the case L = 5. The set U, shown 
at the bottom of the figure, has a finite number of elements. For each m = 1,...,5, 
we let U,, be a De™-cover of the set U; the elements of the cover are shaded in gray 
at each level. For each element @ € U, we construct the chain by setting y? = 6, 
and then recursively y”! = mn_1(y'") for m = 5,...,2. We can then decompose the 
difference Xo — X,: as a sum (5.47) of terms along the edges of a tree. 


associated chain as a sum of increments. Given any other @ € U, we can define the chain 
{y',...,7"}, and then derive an analogous bound for the increment |X; — X;:|. By appropri- 
ately adding and subtracting terms and then applying the triangle inequality, we obtain 
IXo = Xgl = |X; = Xi + (Xo = Xy) + (XS = X)| 
< |X; = Xal + |Xo — X| + Xz- Xal. 
Taking maxima over 6,6 € U on the left-hand side and using our upper bounds on the 
right-hand side, we obtain 


É 
max |X, — X| < max |X, — Xl +2 y max |Xg — X;,,_,(g)|- 
6,0€U y ycUı =r} BEUn 


We first upper bound the finite maximum over U,, which has N (2) := Ny(2;T) elements. 
By the sub-Gaussian nature of the process, the increment X, — Xy is sub-Gaussian with 
parameter at most px(y,y) < D. Consequently, by our earlier results on finite Gaussian 
maxima (see Exercise 2.12), we have 


e| max |X, — XI < 2D Vlog N(D/2). 
VEU] 


Similarly, for each m = 2,3,...,L, the set U,, has N(D2-”) elements, and, moreover, 
maxgeu, Px(B, %n-1(B)) < D2-"", whence 


< 2 D2-"")) log N(D2-"). 


E| max |X, — X 
| ne B- Xz, 
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Combining the pieces, we conclude that 


<4 2 p2-"-») flog N(D2-"). 


| ax — 


6,0€U 


Since the metric entropy log N(t) is non-increasing in t, we have 
D2” 


D2-""» flog N(D2-") < 4 vlog N(u) du, 


D2-(m+1) 


and hence 2 E[max,-y [Xo — Xgl] < 32 i, vlog N(u) du. 


Let us illustrate the Dudley entropy bound with some examples. 


Example 5.23 In Example 5.20, we showed that the Gaussian complexity of the para- 


metric function class Y was upper bounded by o( qa ), a result obtained by the naive 
discretization bound. Here we show that the Dudley entropy integral yields the sharper up- 
per bound O(1/~yn). In particular, since the L..-norm metric entropy is upper bounded as 
log N(6; P, Il- Ilo) = Olog(1 + 1/6)), the Dudley bound implies that 


d E <f Jlog + 1/u)du = T 


Thus, we have removed the logarithmic factor from the naive discretization bound. 4 


Recall from Chapter 4 our discussion of the Vapnik—Chervonenkis dimension. As we now 
show, the Dudley integral can be used to obtain a sharp result for any finite VC class. 


Example 5.24 (Bounds for Vapnik—Chervonenkis classes) Let ¥ be a b-uniformly boun- 
ded class of functions with finite VC dimension y, and suppose that we are interested in es- 
tablishing a uniform law of large numbers for F 
= F ifl, where X; ~ P are i.i.d. samples. As discussed in Chapter 4, by 
exploiting concentration and symmetrization results, the study of this random variable can be 
reduced to controlling the expectation E,[ 1 DET (x,)], where z; are 1.1.d. Rade- 
macher variables (random signs), and the observations x; are fixed for the moment. 

In order to see pow Dudley’s entropy integral may be applied, define the zero-mean ran- 
dom variable Z; := i XL €if(x;), and consider the stochastic process {Zy | f € F}. It is 
straightforward to verify that the increment Z; — Z, is sub-Gaussian with parameter 


1 n 
If- alle, = 5, DF = 8a? 


Consequently, by Dudley’s entropy integral, we have 


POOE <= | 


where we have used the fact that sup; gez If — gllp, < 2b. Now by known results on VC 


© flog Ne FIle) at (5.48) 


Jal 
SEF 


5.4 Some Gaussian comparison inequalities 143 


classes and metric entropy, there is a universal constant C such that 
b 2v 
Ne; F, |l- lle,) < cwisey(} (5.49) 
€ 


See Exercise 5.4 for the proof of a weaker claim of this form, and the bibliographic section 
for further discussion of such bounds. 

Substituting the metric entropy bound (5.49) into the entropy integral (5.48), we find that 
there are universal constants co and cı, depending on b but not on (v, n), such that 


1 n 2b 
| sup |- > eif (xD | < co J 1+ vlog(b/t) a 
feF |n a n 0 
ae fe (5.50) 
n 
since the integral is finite. 4 


Note that the bound (5.50) is sharper than the earlier af ee bound that we proved 
in Lemma 4.14. It leads to various improvements of previous results that we have stated. 
For example, consider the classical Glivenko—Cantelli setting, which amounts to bounding 
IIF, n — Flo = SUPyeR |F (u) — F(u)|. Since the set of indicator functions has VC dimension 
v = I, the bound (5.50), combined with Theorem 4.10, yields that 


nô? 


> <2e 8 for all 6 > 0, (5.51) 


=> C 
PI IF, — Ello > — + ô 
| | Vi 


where c is a universal constant. Apart from better constants, this bound is unimprovable. 


5.4 Some Gaussian comparison inequalities 


Suppose that we are given a pair of Gaussian processes, say {Yg, 0 € T} and {Z», 6 € T}, both 
indexed by the same set T. It is often useful to compare the two processes in some sense, 
possibly in terms of the expected value of some real-valued function F defined on the pro- 
cesses. One important example is the supremum F(X) := sup,.7 Xo. Under what conditions 
can we say that F(X) is larger (or smaller) than F(Y)? Results that allow us to deduce such 
properties are known as Gaussian comparison inequalities, and there are a large number of 
them. In this section, we derive a few of the standard ones, and illustrate them via a number 
of examples. 


Recall that we have defined the suprema of Gaussian processes by taking limits of maxima 
over finite subsets. For this reason, it is sufficient to consider the case where T is finite, say 
T = {1,..., N} for some integer N. We focus on this case throughout this section, adopting 
the notation [N] = {1,...,N} as a convenient shorthand. 


5.4.1 A general comparison result 


We begin by stating and proving a fairly general Gaussian comparison principle: 
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Theorem 5.25 Let (X,...,Xy) and (Y,..., Yy) be a pair of centered Gaussian ran- 
dom vectors, and suppose that there exist disjoint subsets A and B of [N] x [N] such 
that 


E[X;X;] < E[¥i¥;] for all (i, j) € A, (5.52a) 
F[X;X;] > ELY;Y;] for all (i, j) € B, (5.52b) 
E[X;X] = E[¥i¥;] forall (i, j) ¢ A UB. (5.52c) 
Let F: RY > R be a twice-differentiable function, and suppose that 
OF 
(u) = 0 for all (i, j) € A, (5.53a) 
OujOUu ; 
OF rae 
(u) <0 for all G, j) € B. (5.53b) 
OujOUu ; 


Then we are guaranteed that 


E[F(X)] < E[F(Y)]. (5.54) 


Proof We may assume without loss of generality that X and Y are independent. We proceed 
via a classical interpolation argument: define the Gaussian random vector 


Z(t) = V1-tX+ Vt, for each t € [0, 1], (5.55) 


and consider the function ¢: [0,1] —> R given by ¢() = ELF(Z()]. If we can show that 
¢’(t) = 0 for all t € (0, 1), then we may conclude that 


HEY) = 40) 2 40) = ELF(X)]. 


With this goal in mind, for a given t € (0,1), we begin by using the chain rule to compute 
the first derivative 


X [óF 
p=) E Ezo zol 


j=l 
where Z(t) := 47 (t) = -zX zF zÝ Computing the expectation, we find that 
1 1 
E[Z,(t) ZO] = :| V1- tX; + VtY; -== + r) 
( ) 2VI-t” ae 
1 
z5 {ELY;Y;] — E[X;X;]}. 
Consequently, for each i = 1,...,N, we can write? Z(t) = ajZ'(t) + W;;, where a;; > 0 
for (i, j) € A, aj; < 0 for (i, j) € B, and a;; = 0 if (i, j) ¢ A U B. Moreover, due to the 
Gaussian assumption, we are guaranteed that the random vector W(j) := (Wi;,..., Wy j) is 


independent of Z’(¢). 


> The variable W;; does depend on t, but we omit this dependence so as to simplify notation. 
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Since F is twice differentiable, we may apply a first-order Taylor series to the function 
OF /dz; between the points W(j) and Z(t), thereby obtaining 


OF OF, on OF ; 
a 2) = 5 VD) + 21 Begg 0 WO 


where U € R” is some intermediate point between W(j) and Z(t). Taking expectations then 
yields 


~ OF , 2 
+ Ye Fe waco 


i=1 


=o vago], 


i= 


OF 
z Eeo o|- | an: WOZO 
J 


where the first term vanishes since W(j) and Z‘(t) are independent, and Z‘(t) is zero-mean. 
By our assumptions on the second derivatives of f and the previously stated conditions on 
Qij, We have 24& 320% F (U) aj; j 2 0, so that we may conclude that ø'(t) > 0 for all t € (0, 1), which 
completes the proof. 


5.4.2 Slepian and Sudakov—Fernique inequalities 


An important corollary of Theorem 5.25 is Slepian’s inequality. 


Corollary 5.26 (Slepian’s inequality) Let X € R“ and Y € R“ be zero-mean Gaussian 
random vectors such that 


E[X;X;]>ELYi¥;] for alli # j, (5.56a) 
Biases] ori eooo (5.56b) 
Then we are guaranteed 
EL max Xi] < EL max, Y;]. (5.57) 
< p 


Proof In order to study the maximum, let us introduce, for each 6 > 0, a real-valued 
function on RY via Fg(x) := ~! log { pe , exp(6x;)}. By a straightforward calculation, we 
find the useful relation 


log N 
ax ea < Fg(x) < me cj + 3 ; valid for all 8 > 0, (5.58) 


pa CS I 


so that bounds on Fg can be used to control the maximum by taking B > +oo, Mele that 
F is twice differentiable for each 6 > 0; moreover, some calculus shows that aie + <0 


for all i + j. Consequently, applying Theorem 5.25 with A = @ and B = {(i, j), i + j) yields 
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that E[F,(X)] < ELF4(Y)]. Combining this inequality with the sandwich relation (5.58), we 
conclude that 


Hl..N L | JHl,..., 


|<] max v; 
j=1,..,N 


and taking the limit 8 — +00 yields the claim. 


Note that Theorem 5.25 and Corollary 5.26 are stated in terms of the variances and corre- 
lations of the random vector. In many cases, it is more convenient to compare two Gaussian 
processes in terms of their associated pseudometrics 


ij =EX -X and (i, j) = EY,- Y. 


The Sudakov—Fernique comparison is stated in exactly this way. 


Va 


Theorem 5.27 (Sudakov—Fernique) Given a pair of zero-mean N-dimensional Gaus- 
sian vectors (X;,..., Xy) and (Y,,..., Yy, suppose that 


E[(X; - x)"] < E[(Y; - Y] for all (i, j) € [N] x [N]. (5.59) 


N a pn A Eeo n ba here ot a Labor 


S 


Remark: It is worth noting that the Sudakov—Fernique theorem also yields Slepian’s in- 
equality as a corollary. In particular, if the Slepian conditions (5.56a) hold, then it can be 
seen that the Sudakov—Fernique condition (5.59) also holds. The proof of Theorem 5.27 is 
more involved than that of Slepian’s inequality; see the bibliographic section for references 
to some proofs. 


5.4.3 Gaussian contraction inequality 


One important consequence of the Sudakov—Fernique comparison is the Gaussian contrac- 
tion inequality, which applies to functions ¢;: R — R that are 1-Lipschitz, meaning that 
|p (s) — $j(0)| < Is — t| for all s,t € R, and satisfy the centering relation ¢ (0) = 0. Given a 
vector 6 € Rf, we define (with a minor abuse of notation) the vector 


DO) = (01101), D), «++, Gala) € RY. 


Lastly, given a set T c Rf, we let (T) = {¢(6) | 6 € T} denote its image under the mapping 
ġ. The following result shows that the Gaussian complexity of this image is never larger than 
the Gaussian complexity G(T) of the original set. 
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Proposition 5.28 (Gaussian contraction inequality) For any set T © R? and any family 
of centered 1-Lipschitz functions {ġ;, j = 1,...,d}, we have 


d d 

e| sup y w 0 (0)| <E sup Š, wo} (5.60) 
cT jal 6eT Ja 
GOT) GT) 


We leave the proof of this claim for the reader (see Exercise 5.12). For future reference, we 
also note that, with an additional factor of 2, an analogous result holds for the Rademacher 


R(AT)) < 2R(T) (5.61) 


for any family of centered 1-Lipschitz functions. The proof of this result is somewhat more 
delicate than the Gaussian case; see the bibliographic section for further discussion. 


Let us illustrate the use of the Gaussian contraction inequality (5.60) with some examples. 


Example 5.29 Given a function class ¥ and a collection of design points x}, we have pre- 
viously studied the Gaussian complexity of the set F (x}) c R” defined in equation (5.30). 
In various statistical problems, it is often more natural to consider the Gaussian complexity 
of the set 


FAA = (PED, £02), Fm) If € FY CR’, 


where f7(x) = [f(x)[ are the squared function values. The contraction inequality allows us 
to upper bound the Gaussian complexity of this set in terms of the original set Y(x/). In 
particular, suppose that the function class is b-uniformly bounded, so that ||fllo < b for all 
f € F. We then claim that 


QF (x) < 2b G F(x), (5.62) 


so that the Gaussian complexity of F(x") is not essentially larger than that of F (x?). 
In order to establish this bound, define the function øp: R — R via 


iis: /(2b) if |t| < b, 
Pol) = b/2 otherwise. 


Since | f(x;)| < b, we have $,(f(%;)) = Foi) as for all f € F andi =1,2,...,n, and hence 


1 n 
ap GF? (x1) = E >) oS Wi Po wl. = >) a 


JEF Gy 


Moreover, it is straightforward to verify that ø, is a contraction according to our definition, 
and hence applying Proposition 5.28 yields 


[ fup > Wi mas E sun > > wif (xi) |- GF (xD). 


JEF Fay JEF Fa 
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Putting together the pieces yields the claim (5.62). & 


5.5 Sudakov’s lower bound 


In previous sections, we have derived two upper bounds on the expected supremum of a sub- 
Gaussian process over a given set: the simple one-step discretization in Proposition 5.17, 
and the more refined Dudley integral bound in Theorem 5.22. In this section, we turn to the 
complementary question of deriving lower bounds. In contrast to the upper bounds in the 
preceding sections, these lower bounds are specialized to the case of Gaussian processes, 
since a general sub-Gaussian process might have different behavior than its Gaussian analog. 
For instance, compare the Rademacher and Gaussian complexity of the ¢;-ball, as discussed 
in Example 5.14. 

This section is devoted to the exploration of a lower bound known as the Sudakov minor- 
ation, which is obtained by exploiting the Gaussian comparison inequalities discussed in the 
previous section. 


Theorem 5.30 (Sudakov minoration) Let {X», 0 € T} be a zero-mean Gaussian pro- 
cess defined on the non-empty set T. Then 


ô 
e| sup x| > sup 5 vlog Mx(6; T), (5.63) 
ô>0 


deT 


where My(6; 1) is the 6-packing number of I in the metric px(0, 8) := VEX. - X]. 


XM 


Proof For any ô > 0, let {6',..., 0} be a 6-packing of T, and consider the sequence {Y;}"", 
with elements Y; := Xa. Note that by construction, we have the lower bound 


ELY; — Y¥))°] = py (6,0) > & for all i # j. 


Now let us define an i.i.d. sequence of Gaussian random variables Z; ~ N(0,67/2) for 
i = 1,...,M. Since E[(Z; — Z;)*] = & for all i + j, the pair of random vectors Y and Z 
satisfy the Sudakov—Fernique condition (5.59), so that we are guaranteed that 


> E| max yi] > 
bira ME 


F | sup Xo 
deT 


Since the variables {Z;}!, are zero-mean Gaussian and i.i.d., we can apply standard results 
on i.i.d. Gaussian maxima (viz. Exercise 2.11) to obtain the lower bound E[max;-=1,. m Zil = 


3 vlog M, thereby completing the proof. 


Let us illustrate the Sudakov lower bound with some examples. 


Example 5.31 (Gaussian complexity of f-ball) We have shown previously that the Gaus- 
sian complexity G( BS) of the d-dimensional Euclidean ball is upper bounded as G( BS) < Vd. 
We have verified this fact both by direct calculation and through use of the upper bound in 
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Proposition 5.17. Here let us show how the Sudakov minoration captures the complemen- 
tary lower bound. From Example 5.9, the metric entropy of the ball B in /2-norm is lower 
bounded as log N2(6; B“) > dlog(1/6). Thus, by Lemma 5.5, the same lower bound applies 
to log M2(6 ; B“). Therefore, the Sudakov bound (5.63) implies that 


4log 4 
BS) > TE Vaos 75} > = Vd, 
ô>0 


where we set 6 = 1/4 in order to obtain the second inequality. Thus, in this simple case, 
the Sudakov lower bound recovers the correct scaling as a function of Vd, albeit with sub- 
optimal control of the constant. 4 


G 


oN 


We can also use the Sudakov minoration to upper bound the metric entropy of a set T, 
assuming that we have an upper bound on its Gaussian complexity, as illustrated in the 
following example. 


Example 5.32 (Metric entropy of ¢;-ball) Let us use the Sudakov minoration to upper 
bound the metric entropy of the ¢;-ball B4 ={0e R| EL |0;| < 1}. We first observe that its 
Gaussian complexity can be upper bounded as 


GB) = | sup (w, o = Eļllwll»] < 2 Ylogd, 
Allisd 

where we have used the duality between the ¢\- and ¢..-norms, and standard results on 

Gaussian maxima (see Exercise 2.11). Applying Sudakov’s minoration, we conclude that 

the metric entropy of the d-dimensional ball B? in the €-norm is upper bounded as 


log N(6; B4, ||- l2) < c(1/6)* logd. (5.64) 


It is known that (for the most relevant range of 6) this upper bound on the metric entropy 
of Bf is tight up to constant factors; see the bibliographic section for further discussion. We 
thus see in a different way how the ¢,-ball is much smaller than the €-ball, since its metric 
entropy scales logarithmically in dimension, as opposed to linearly. + 


As another example, let us now return to some analysis of the singular values of Gaussian 
random matrices. 


Example 5.33 (Lower bounds on maximum singular value) As a continuation of Exam- 
ple 5.19, let us use the Sudakov minoration to lower bound the maximum singular value of 
a standard Gaussian random matrix W € R’*“. Recall that we can write 


[II WIll2] = E | sup «W, o| 

OeMa(1) 
where the set M"“(1) was previously defined (5.37). Consequently, in order to lower bound 
E [|W ll2] via Sudakov minoration, it suffices to lower bound the metric entropy of M”4(1) in 
the Frobenius norm. In Exercise 5.13, we show that there is a universal constant cı such that 


log M(6; (1); Il - llr) > c}(2 +d) log(1/6) for all 6 € (0, 4). 
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Setting 6 = 1/4, the Sudakov minoration implies that 


EMWi2] > & V+ a) Viogã > c\(-vn + Va). 


Comparing with the upper bound from Example 5.19, we see that this lower bound has the 
correct scaling in the pair (n, d). & 


5.6 Chaining and Orlicz processes 


In Section 5.3.3, we introduced the idea of chaining, and showed how it can be used to ob- 
tain upper bounds on the expected supremum of a sub-Gaussian process. When the process 
is actually Gaussian, then classical concentration results can be used to show that the supre- 
mum is sharply concentrated around this expectation (see Exercise 5.10). For more general 
sub-Gaussian processes, it is useful to be able to derive similar bounds on the probability of 
deviations above the tail. Moreover, there are many processes that do not have sub-Gaussian 
tails, but rather instead are sub-exponential in nature. It is also useful to obtain bounds on 
the expected supremum and associated deviation bounds for such processes. 

The notion of an Orlicz norm allows us to treat both sub-Gaussian and sub-exponential 
processes in a unified manner. For a given parameter q € [1,2], consider the function y(t) := 
exp(t1) — 1. This function can be used to define a norm on the space of random variables as 
follows: 


Definition 5.34 (Orlicz norm) The w,-Orlicz norm of a zero-mean random variable 
X is given by 


IXlly, = inf{a > 0 | Ely4(1X1/®] < 1. (5.65) 


The Orlicz norm is infinite if there is no 2 € R for which the given expectation is finite. 


Any random variable with a bounded Orlicz norm satisfies a concentration inequality 
specified in terms of the function w,. In particular, we have 
@ 


PIXI > A 2 Plyg(IXI/IXlly,) = W/X) $ 


1 
Yati) 
where the equality (i) follows because y is an increasing function, and the bound (ii) fol- 
lows from Markov’s inequality. In the case q = 2, this bound is essentially equivalent to our 
usual sub-Gaussian tail bound; see Exercise 2.18 for further details. 


Based on the notion of the Orlicz norm, we can now define an interesting generalization 
of a sub-Gaussian process: 
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Definition 5.35 A zero-mean stochastic process {X,,@ € T} is a W,-process with 
respect to a metric p if 


IXo— Xll, < 0,0)  forallo,ð eT. (5.66) 


As a particular example, in this new terminology, it can be verified that the canonical Gaus- 
sian process is a y2-process with respect to the (scaled) Euclidean metric (9, 0) = 2 ||0 — 4l2. 


We define the generalized Dudley entropy integral 


D 
TEDE f y7’ (NG; T, p)) du, (5.67) 
ô 


where w;' is the inverse function of y4, and D = SUPgzer P(O, 6) is the diameter of the set T 
under p. For the exponential-type functions considered here, note that we have 


y7 u) = fog + w)". (5.68) 


With this set-up, we have the following result: 


Theorem 5.36 Let {X,, 0 € T} be a W,-process with respect to p. Then there is a 
universal constant cı such that 


P| sup [Xo — Xo] > (ITO; D)+ A| <27 forallt>0. (5.69) 
6,0ET 


A few comments on this result are in order. Note that the bound (5.69) involves the gener- 
alized Dudley entropy integral (5.67) for 6 = 0. As with our earlier statement of Dudley’s 
entropy integral bound, there is a generalization of Theorem 5.36 that involves the truncated 
form, along with some discretization error. Otherwise, Theorem 5.36 should be understood 
as generalizing Theorem 5.22 in two ways. First, it applies to general Orlicz processes for 
q € [1,2], with the sub-Gaussian setting corresponding to the special case g = 2. Second, it 
provides a tail bound on the random variable, as opposed to a bound only on its expectation. 
(Note that a bound on the expectation can be recovered by integrating the tail bound, in the 
usual way.) 


Proof We begin by stating an auxiliary lemma that is of independent interest. For any 
measurable set A and random variable Y, let us introduce the shorthand notation E,[Y] = 
f Y dP. Note that we have E4[Y] = E[Y | Y € A] P[A] by construction. 
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p 
Lemma 5.37 Suppose that Y,,...,Yy are non-negative random variables such that 
Villy, < 1. Then for any measurable set A, we have 

E,l¥i] < P[A]y,'C/P(A)) for alli =1,2,...,N, (5.70) 
and moreover 
E max, Y;| < P[Aly,' a (5.71) 
hence Pa) i 


< 


Proof Let us first establish the inequality ak By definition, we have 


EA[Y] = PIA] = Ea ly} y] 


a 
< PIAI (e sls sail 
$ PIA] 
< PLA] Wy on ) 
P[A] 
where step (i) uses concavity of Wy! and Jensen’s inequality (noting that the ratio Au defines 


a conditional distribution); whereas step (ii) uses the fact that E,[W,(Y)] < ElW,(Y)] < 1, 
which follows since y,(Y) is non-negative, and the Orlicz norm of Y is at most one, com- 
bined with the fact that W, is an increasing function. 

We now prove its extension (5.71). Any measurable set A can be partitioned into a disjoint 
union of sets A;, i = 1,2,...,N, such that Y; = maxj-), Y; on A;. Using this partition, we 
have 


ad E P[A;] 1 
ra ga Y| = X Eris aly p PIA” (a) 


N 
SMAN (a 


where the last step uses the concavity of We , and Jensen’s inequality with the weights 


P[A;]/P[A]. 


In order to appreciate the relevance of this lemma for Theorem 5.36, let us use it to show 
that the supremum Z := sup, 3-7 |Xo — Xgl satisfies the inequality 


F4([Z] < 8 Pai f v cred du. (5.72) 


Choosing A to be the full probability space immediately yields an upper bound on the ex- 
pected supremum—namely E[Z] < 8.f,(D). On the other hand, if we choose A = {Z > t}, 
then we have 


piza te 2s 24 f v (a, 


PIZ > t] 
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where step (i) follows from a version of Markov’s inequality, and step (ii) follows from the 
bound (5.72). Canceling out a factor of P[Z > t] from both sides, and using the inequality 
Wa (st) < cW4'(s) + Wz), we obtain 


D 1 
-1 ‘ ra 
1 < Bo f Ww, (NCu; T, p)) du + Dy} (zal) 


Let 6 > 0 be arbitrary, and set t = 8&c(J;(D) + 6). Some algebra then yields the inequality 


ô< Dyz( or equivalently 


Fiza) 
P[z=r] )? 


1 
PIZ > 8c(Fq(D) + ô)] < OID)’ 
as claimed. 

In order to prove Theorem 5.36, it suffices to establish the bound (5.72). We do so by 
combining Lemma 5.37 with the chaining argument previously used to prove Theorem 5.22. 
Let us recall the set-up from that earlier proof: by following the one-step discretization 
argument, our problem was reduced to bounding the quantity E[supgzey |Xe—Xgl], where U = 
{o',...,@%} was a 6-cover of the original set. For each m = 1,2,...,L, let U,, be a minimal 
D2-"-cover of U in the metric py, so that at the mth step, the set Um has Nm = Nx (€m; U) 
elements. Similarly, define the mapping nm: U > Unm via 7,,(@) = arg min,cu,, Px(, y), so 
that 7,,(0) is the best approximation of 0 € U from the set U,,. Using this notation, we 
derived the chaining upper bound 


Ea maxX = Xl 


6,0EU 


< D a| max IX; = Xn aol: (5.73) 


yeUm 


(Previously, we had the usual expectation, as opposed to the object E4 used here.) For each 
y € Un, we are guaranteed that 


IX) — Xz, 1qyllu, SOx; Tm-(Y)) < DICEN, 
Since |U,,| = N(D2™), Lemma 5.37 implies that 


VEUm 


L | max X; 7 EEN (y) 


< P[A] D2" Pye (m) 


P(A) 
for every measurable set A. Consequently, from the upper bound (5.73), we obtain 


< 2P1A1 $, De iF (a 20) 


N. U 
< pa f al — MeD) u, 


since the sum can be upper bounded by the integral. 


C4] Max IXe — Xgl 
0,0€U 


Oy 


5.7 Bibliographic details and background 


The notion of metric entropy was introduced by Kolmogorov (1956; 1958) and further de- 
veloped by various authors; see the paper by Kolmogorov and Tikhomirov (1959) for an 
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overview and some discussion of the early history. Metric entropy, along with related no- 
tions of the “sizes” of various function classes, are central objects of study in the field of 
approximation theory; see the books (DeVore and Lorentz, 1993; Pinkus, 1985; Carl and 
Stephani, 1990) for further details on approximation and operator theory. Examples 5.10 
and 5.11 are discussed in depth by Kolmogorov and Tikhomirov (1959), as is the metric en- 
tropy bound for the special ellipsoid given in Example 5.12. Mitjagin (1961) proves a more 
general result, giving a sharp characterization of the metric entropy for any ellipsoid; see 
also Lorentz (1966) for related results. 

The pioneering work of Dudley (1967) established the connection between the entropy 
integral and the behavior of Gaussian processes. The idea of chaining itself dates back to 
Kolmogorov and others. Upper bounds based on entropy integrals are not always the best 
possible. Sharp upper and lower bounds for expected Gaussian suprema can be derived by 
the generic chaining method of Talagrand (2000). The proof of the Orlicz-norm generaliza- 
tion of Dudley’s entropy integral in Theorem 5.36 is based on Ledoux and Talagrand (1991). 

The metric entropy of the £,-ball was discussed in Example 5.32; more generally, sharp 
upper and lower bounds on the entropy numbers of £,-balls for q € (0, 1] were obtained by 
Schütt (1984) and Kühn (2001). Raskutti et al. (2011) convert these estimates to upper and 
lower bounds on the metric entropy; see Lemma 2 in their paper. 

Gaussian comparison inequalities have a lengthy and rich history in probability theory and 
geometric functional analysis (e.g., Slepian, 1962; Fernique, 1974; Gordon, 1985; Kahane, 
1986; Milman and Schechtman, 1986; Gordon, 1986, 1987; Ledoux and Talagrand, 1991). 
A version of Slepian’s inequality was first established in the paper (Slepian, 1962). Ledoux 
and Talagrand (1991) provide a detailed discussion of Gaussian comparison inequalities, 
including Slepian’s inequality, the Sudakov—Fernique inequality and Gordon’s inequalities. 
The proofs of Theorems 5.25 and 5.36 follow this development. Chatterjee (2005) provides 
a self-contained proof of the Sudakov—Fernique inequality, including control on the slack 
in the bound; see also Chernozhukov et al. (2013) for related results. Among other results, 
Gordon (1987) provides generalizations of Slepian’s inequality and related results to ellip- 
tically contoured distribution. Section 4.2 of Ledoux and Talagrand (1991) contains a proof 
of the contraction inequality (5.61) for the Rademacher complexity. 

The bound (5.49) on the metric entropy of a VC class is proved in Theorem 2.6.7 of van 
der Vaart and Wellner (1996). Exercise 5.4, adapted from this same book, works through the 
proof of a weaker bound. 


5.8 Exercises 


Exercise 5.1 (Failure of total boundedness) Let C([0, 1], b) denote the class of all convex 
functions f defined on the unit interval such that ||fllo < b. Show that C([0, 1], b) is not 
totally bounded in the sup-norm. (Hint: Try to construct an infinite collection of functions 
eee such that || f/ — f*||.. = 1/2 for all j + k.) 


Exercise 5.2 (Packing and covering) Prove the following relationships between packing 
and covering numbers: 


(a) ©) 
M(26; 1, p) < N@; 1, p) < MO; T, p). 
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Exercise 5.3 (Packing of Boolean hypercube) Recall from Example 5.3 the binary hyper- 
cube H’ = {0, 1}¢ equipped with the rescaled Hamming metric (5.1b). Prove that the packing 
number satisfies the bound 


log Me HA log(d + 1) 


< D(8/2|| 1/2) + =d > 


where D(6/2 || 1/2) = $ log a, +(1 $) log = l a 2 is the Kullback—Leibler divergence between 
the Bernoulli distributions with parameter 5 i 2 and 1/2. (Hint: You may find the result of 


Exercise 2.10 to be useful.) 


Exercise 5.4 (From VC dimension to metric entropy) In this exercise, we explore the 
connection between VC dimension and metric entropy. Given a set class S with finite VC 
dimension v, we show that the function class Fs := {ls,5 € S} of indicator functions has 
metric entropy at most 


2v 
NG; Fs, LP) < Kon 3) y for a constant K(v). (5.74) 


Let {Is:,..., Igy} be a maximal 5-packing in the L'(P)-norm, so that 
llis, = Us Ih = Ells% - Us, ON > 6 — for alli +j. 


By Exercise 5.2, this N is an upper bound on the 6-covering number. 


(a) Suppose that we generate n samples X;, i = 1,...,m, drawn i.i.d. from P. Show that 
the probability that every set S; picks out a different subset of {X),...,X,} is at least 
= (S)a - 6)". 
(b) Using part (a), show that for N > 2 andn = , there exists a set of n points from 
which S picks out at least N subsets, and sels that N < (Heny $ 
(c) Use part (b) to show that the bound (5.74) holds with K(v) := y, 


= Aken 


Exercise 5.5 (Gaussian and Rademacher complexity) In this problem, we explore the con- 
nection between the Gaussian and Rademacher complexity of a set. 


(a) Show that for any set T c Rf, the Rademacher complexity satisfies the upper bound 
R(T) < V3 G(T). Give an example of a set for which this bound is met with equality. 

(b) Show that G(T) < 2./logd(T) for any set T c R“. Give an example for which this 
upper bound is tight up to the constant pre-factor. (Hint: In proving this bound, you may 
assume the Rademacher analog of the contraction inequality, namely that R(P(1)) < 
R(T) for any contraction.) 


Exercise 5.6 (Gaussian complexity for ¢,-balls) The @,-ball of unit radius is given by 
Ba(1) = {8 € R° | |lAll, < 1}, 


where ||6||, = ean l0)" for q € [1, 00) and ||6||.. = max; |8;l. 
(a) For q € (1, 00), show that there are constants c, such that 


dee G(Bs(1)) 


-< < cq. 


m PE 
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(b) Compute the Gaussian complexity G(B“(1)) exactly. 
Exercise 5.7 (Upper bounds for foọ-“balls”) Consider the set 
Ts) := {8 € R° | |lêllo < s, Ill < 1}, 


corresponding to all s-sparse vectors contained within the Euclidean unit ball. In this exer- 
cise, we prove that its Gaussian complexity is upper bounded as 


GTA 4 [stoe(“) (5:75) 


(a) First show that G(T“(s)) = E [ max Iwslle]; where ws € RIS! denotes the subvector of 


(w1, ..., wq) indexed by the subset S c {1,2,..., d}. 
(b) Next show that 


Pliwslh => Vs +ô] < e”? 


for any fixed subset S of cardinality s. 
(c) Use the preceding parts to establish the bound (5.75). 


Exercise 5.8 (Lower bounds for fo-“balls”) In Exercise 5.7, we established an upper bound 
on the Gaussian complexity of the set 


T4(s) := {0 € R° | llall < s, [I6ll2 < 1}. 
The goal of this exercise to establish the matching lower bound. 


(a) Derive a lower bound on the 1/ v2 covering number of T@(s) in the Euclidean norm. 
(Hint: The Gilbert-Varshamov lemma could be useful to you). 
(b) Use part (a) and a Gaussian comparison result to show that 


GTS) z toe“) 


Exercise 5.9 (Gaussian complexity of ellipsoids) Recall that the space ¢7(IN) consists 


of all real sequences (6;);’, such that yi 6; < oo, Given a strictly positive sequence 


(u D € €°(IN), consider the associated ellipse 


272 
aig sa. 
1 


j= 


6:= (oe 


Ellipses of this form will play an important role in our subsequent analysis of the statistical 
properties of reproducing kernel Hilbert spaces. 


(a) Prove that the Gaussian complexity satisfies the bounds 


5) oo 1/2 oo 
(È) <9) <| Xi] 
j j 


(Hint: Parts of previous problems may be helpful to you.) 


1/2 
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(b) For a given radius r > 0 consider the truncated set 

Be? 
Dash. 
j=l 


Obtain upper and lower bounds on the Gaussian complexity GEN) that are tight up to 
universal constants, independent of r and (qj). (Hint: Try to reduce the problem to an 
instance of (a).) 


Sr) = EN {ong 


Exercise 5.10 (Concentration of Gaussian suprema) Let {X»,0 € T} be a zero-mean Gaus- 
sian process, and define Z = sup,.7 Xo. Prove that 


PIZ - E[Z]| > 6] < 2e 2, 


where o° := suUpgey Var(Xo) is the maximal variance of the process. 


Exercise 5.11 (Details of Example 5.19) In this exercise, we work through the details of 
Example 5.19. 


(a) Show that the maximum singular value |||W|||2 has the variational representation (5.38). 

(b) Defining the random variable Xe = KW, ©)), show that the stochastic process {Xo, 
© e M"4(1)} is zero-mean, and sub-Gaussian with respect to the Frobenius norm 
| — O'llr. 

(c) Prove the upper bound (5.40). 

(d) Prove the upper bound (5.41) on the metric entropy. 


Exercise 5.12 (Gaussian contraction inequality) For each j = 1,...,d, let @; : R —> R be 
a centered 1-Lipschitz function, meaning that ¢ (0) = 0, and |¢,(s) — #,(0)| < Is — t| for all 
s,t € R. Given a set T c R®, consider the set 


AT) := {(G1(O1), $2(O2),--- Bala) 10 € T} CRY. 
Prove the Gaussian contraction inequality G(¢(1)) < G(T). 


Exercise 5.13 (Details of Example 5.33) Recall the set M"“(1) from Example 5.33. Show 
that 


log M(6; m4(1)s [I+ Ile) = (n+ d)log(1/6) for all 6 € (0, 1/2). 


Exercise 5.14 (Maximum singular value of Gaussian random matrices) In this exercise, we 
explore one method for obtaining tail bounds on the maximal singular value of a Gaussian 
random matrix W € R”* with i.i.d. N(O, 1) entries. 


(a) To build intuition, let us begin by doing a simple simulation. Write a short computer 
program to generate Gaussian random matrices W € R’™@ for n = 1000 and d = [an], 
and to compute the maximum singular value of W/yn, denoted by Omax(W)/vn. Per- 
form T = 20 trials for each value of œ in the set {0.1 + k(0.025), k = 1,..., 100}. Plot 
the resulting curve of œ versus the average of Omax(W)/n. 
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(b) Now let’s do some analysis to understand this behavior. Prove that 
Omax(W) = sup sup u' Wy, 
ues"! yega! 


where S%! = {y € R¢ | |lyllz = 1} is the d-dimensional Euclidean sphere. 
(c) Observe that Z,» := u’Wv defines a Gaussian process indexed by the Cartesian product 
T := S$"! x S*!. Prove the upper bound 


E [Omax(W)] = E | sup W < yn+ Vd. 
(u,v)eT 
(Hint: For (u,v) € S"! x S“!, consider the zero-mean Gaussian variable Finir 
(g, uy+<h, vy, where g € N(O, Inxn) and h ~ N(O, Iaxa) are independent Gaussian random 
vectors. We thus obtain a second Gaussian process {Y,,,, (u, v) € S™! x S%!}, and you 
may find it useful to compare {Z,,,} and {Y,,,}.) 
(d) Prove that 


2 


P i vn >1+ ue + | < 20°. 
n 


6 


Random matrices and covariance estimation 


Covariance matrices play a central role in statistics, and there exist a variety of methods for 
estimating them based on data. The problem of covariance estimation dovetails with random 
matrix theory, since the sample covariance is a particular type of random matrix. A classical 
framework allows the sample size n to tend to infinity while the matrix dimension d remains 
fixed; in such a setting, the behavior of the sample covariance matrix is characterized by 
the usual limit theory. By contrast, for high-dimensional random matrices in which the data 
dimension is either comparable to the sample size (d x n), or possibly much larger than the 
sample size (d >> n), many new phenomena arise. 

High-dimensional random matrices play an important role in many branches of science, 
mathematics and engineering, and have been studied extensively. Part of high-dimensional 
theory is asymptotic in nature, such as the Wigner semicircle law and the Maréenko—Pastur 
law for the asymptotic distribution of the eigenvalues of a sample covariance matrix (see 
Chapter 1 for illustration of the latter). By contrast, this chapter is devoted to an explo- 
ration of random matrices in a non-asymptotic setting, with the goal of obtaining explicit 
deviation inequalities that hold for all sample sizes and matrix dimensions. Beginning with 
the simplest case—namely ensembles of Gaussian random matrices—we then discuss more 
general sub-Gaussian ensembles, and then move onwards to ensembles with milder tail con- 
ditions. Throughout our development, we bring to bear the techniques from concentration 
of measure, comparison inequalities and metric entropy developed previously in Chapters 2 
through 5. In addition, this chapter introduces new some techniques, among them a class of 
matrix tail bounds developed over the past decade (see Section 6.4). 


6.1 Some preliminaries 


We begin by introducing notation and preliminary results used throughout this chapter, be- 
fore setting up the problem of covariance estimation more precisely. 


6.1.1 Notation and basic facts 
Given a rectangular matrix A € R" with n > m, we write its ordered singular values as 
Omax(A) = (A) = (A) 2 +++ = On(A) = Omin(A) 2 0. 
Note that the minimum and maximum singular values have the variational characterization 


max(A) = max Avl and Omin(A) = min ||Av|[2, (6.1) 
jeSm-1 1 


v veS™- 
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where S*! := {v € R? | |ivll2 = 1} is the Euclidean unit sphere in R“. Note that we have the 
equivalence |||Alllz = Omax(A). 

Since covariance matrices are symmetric, we also focus on the set of symmetric matrices 
in Rf, denoted S“@ := {Q € R®™ | Q = QT}, as well as the subset of positive semidefinite 
matrices given by 


S2 := {Qe S™ | Q > 0h. (6.2) 


From standard linear algebra, we recall the facts that any matrix Q € S% is diagonalizable 
via a unitary transformation, and we use y(Q) € R? to denote its vector of eigenvalues, 
ordered as 


Ymax(Q) = y1(Q) = y2(Q) = +--+ = Ya(Q) = Ymin(Q). 


Note that a matrix Q is positive semidefinite—written Q > 0 for short—if and only if 


Ymin(Q) > 0. 
Our analysis frequently exploits the Rayleigh-Ritz variational characterization of the min- 
imum and maximum eigenvalues—namely 


Ymax(Q) = max v'Qv and Ymin(Q) = min v'Qy. (6.3) 
veSel yeSd-l 
For any symmetric matrix Q, the ;-operator norm can be written as 


IIQll2 = max{Ymax(Q), l¥min(Q)I}, (6.4a) 


by virtue of which it inherits the variational representation 
[Ql == max |v"Q»]. (6.4b) 
veSr! 


Finally, given a rectangular matrix A € Ik"*” with n > m, suppose that we define the m- 


dimensional symmetric matrix R := ATA. We then have the relationship 


yR) = (a (A)? for j=1,...,m. 


6.1.2 Set-up of covariance estimation 


Let us now define the problem of covariance matrix estimation. Let {x1,..., Xn} be a collec- 
tion of n independent and identically distributed samples! from a distribution in R? with zero 
mean, and covariance matrix E = cov(x,) € S?“. A standard estimator of ÈE is the sample 
covariance matrix 


aae hb 
> >, xxt. (6.5) 


n 
i=1 


Since each x; has zero mean, we are guaranteed that E [x;x"] = &, and hence that the random 


matrix È is an unbiased estimator of the population covariance X. Consequently, the error 
matrix X — X has mean zero, and our goal in this chapter is to obtain bounds on the error 


' Tn this chapter, we use a lower case x to denote a random vector, so as to distinguish it from a random matrix. 
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measured in the £,-operator norm. By the variational representation (6.4b), a bound of the 
form |Z — Xl, < € is equivalent to asserting that 


Liz 2 oT 
-> is Vi) — 
nÊ (xi, vò — v kv 


max 
veS! 


<e. (6.6) 


This representation shows that controlling the deviation IE-£lh is equivalent to establishing 
a uniform law of large numbers for the class of functions x + (x, v)*, indexed by vectors 
v € S&!, See Chapter 4 for further discussion of such uniform laws in a general setting. 

Control in the operator norm also guarantees that the eigenvalues of Lare uniformly close 
to those of Ł. In particular, by a corollary of Weyl’s theorem (see the bibliographic section 
for details), we have 


geoi 


max ®© -= y®)]| < IE- Zll- (67) 


A similar type of guarantee can be made for the eigenvectors of the two matrices, but only if 
one has additional control on the separation between adjacent eigenvalues. See our discus- 
sion of principal component analysis in Chapter 8 for more details. 

Finally, we point out the connection to the singular values of the random matrix X € R’”“, 


denoted by {0 Qi Since the matrix X has the vector xT as its ith row, we have 


n 


1 
L3 T_T 
ae XiX; =X X, 


and hence it follows that the eigenvalues of Lare the squares of the singular values of X/yn. 


6.2 Wishart matrices and their behavior 


We begin by studying the behavior of singular values for random matrices with Gaussian 
rows. More precisely, let us suppose that each sample x; is drawn i.i.d. from a multivariate 
N(0, X) distribution, in which case we say that the associated matrix X € R”@, with a 
as its ith row, is drawn from the L-Gaussian ensemble. The associated sample covariance 
E = !X"X is said to follow a multivariate Wishart distribution. 


an 


Theorem 6.1 Let X € R’™ be drawn according to the X-Gaussian ensemble. Then for 
all 6 > 0, the maximum singular value Omax(X) satisfies the upper deviation inequality 


P me > Ymax( VE) (1 +5) + q | Sa (6.8) 


Moreover, for n = d, the minimum singular value O pin(X) satisfies the analogous lower 
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deviation inequality 
min X tr(& TRS 
ry Ymin( VE) (1 — 6) — >] sea (6.9) 
va Von 


Before proving this result, let us consider some illustrative examples. 


Example 6.2 (Operator norm bounds for the standard Gaussian ensemble) Consider a ran- 
dom matrix W € R”? generated with i.i.d. N(0, 1) entries. This choice yields an instance of 
&-Gaussian ensemble, in particular with & = I4. By specializing Theorem 6.1, we conclude 
that for n > d, we have 


O max(W) ie Omin(W) oe ne 
ge <1+6+ a and a >1-6 - (6.10) 


n n 


where both bounds hold with probability greater than 1 — 2e—”®!I2. These bounds on the 
singular values of W imply that 


wor 


d 
aJe+<; where € = T (6.11) 
2 n 


with the same probability. Consequently, the sample covariance r= -w'w is a consistent 
estimate of the identity matrix I4 whenever d/n — 0. + 


The preceding example has interesting consequences for the problem of sparse linear 
regression using standard Gaussian random matrices, as in compressed sensing; in particular, 
see our discussion of the restricted isometry property in Chapter 7. On the other hand, from 
the perspective of covariance estimation, estimating the identity matrix is not especially 
interesting. However, a minor modification does lead to a more realistic family of problems. 


Example 6.3 (Gaussian covariance estimation) Let X € R’@ be a random matrix from 
the &-Gaussian ensemble. By standard properties of the multivariate Gaussian, we can write 
X = WVX, where W e R” is a standard Gaussian random matrix, and hence 
1 
va{-wtw - 1} V| < W22 
n 


2 


1 1 
-XTX -22 -WTW - L| . 
n n 


2 2 


Consequently, by exploiting the bound (6.11), we are guaranteed that, for all 6 > 0, 


2 2 
IÈ - Zll sal 4204 T , (6.12) 
IZII n n 


with probability at least 1 — 2e"*/?, Overall, we conclude that the relative error 
IÈ — Xll2/IIXlll2 converges to zero as long the ratio d/n converges to zero. & 


It is interesting to consider Theorem 6.1 in application to sequences of matrices that sat- 
isfy additional structure, one being control on the eigenvalues of the covariance matrix X. 


Example 6.4 (Faster rates under trace constraints) Recall that {y Ke denotes the or- 
dered sequence of eigenvalues of the matrix £, with yı (È) being the maximum eigenvalue. 
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Now consider a non-zero covariance matrix X that satisfies a “trace constraint” of the form 


r) Lay) 
IE yD 7? 


where C is some constant independent of dimension. Note that this ratio is a rough measure 
of the matrix rank, since inequality (6.13) always holds with C = rank(Z). Perhaps more 
interesting are matrices that are full-rank but that exhibit a relatively fast eigendecay, with 
a canonical instance being matrices that belong to the Schatten q-“balls” of matrices. For 
symmetric matrices, these sets take the form 


(6.13) 


d 
By(Rq) := fz E S” Y WON < R}, (6.14) 


j=l 


where q € [0,1] is a given parameter, and R, > O is the radius. If we restrict to matrices 
with eigenvalues in [—1, 1], these matrix families are nested: the smallest set with g = 0 
corresponds to the case of matrices with rank at most Ro, whereas the other extreme g = 
1 corresponds to an explicit trace constraint. Note that any non-zero matrix X € B,(R,) 
satisfies a bound of the form (6.13) with the parameter C = R,/(yi(2))?. 

For any matrix class satisfying the bound (6.13), Theorem 6.1 guarantees that, with high 
probability, the maximum singular value is bounded above as 


Gail) 25-5) í +6+ v£) (6.15) 
vn n 


By comparison to the earlier bound (6.10) for & = Ig, we conclude that the parameter C 
plays the role of the effective dimension. 4 


We now turn to the proof of Theorem 6.1. 


Proof In order to simplify notation in the proof, let us introduce the convenient shorthand 
Ca = Ymax( VE) and Omin = Ymin( VÐ). Our proofs of both the upper and lower bounds 
consist of two steps: first, we use concentration inequalities (see Chapter 2) to argue that the 
random singular value is close to its expectation with high probability, and second, we use 
Gaussian comparison inequalities (see Chapter 5) to bound the expected values. 


Maximum singular value: As noted previously, by standard properties of the multivariate 
Gaussian distribution, we can write X = WVÆ©, where the random matrix W € R”*? has 
i.i.d. N (0, 1) entries. Now let us view the mapping W > eee as a real-valued function 
on R™. By the argument given in Example 2.32, this function is Lipschitz with respect to 
the Euclidean norm with constant at most L = Fmax/ Vn. By concentration of measure for 
Lipschitz functions of Gaussian random vectors (Theorem 2.26), we conclude that 


Plo maxX) 2 l [T ma(X)] + VINE max] < ee: 
Consequently, it suffices to show that 


ELOmax(X)] < VaFmax + VE). (6.16) 


In order to do so, we first write Cmax(X) in a variational fashion, as the maximum of a 
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suitably defined Gaussian process. By definition of the maximum singular value, we have 
Omax(X) = MaXyese-t ||Xv' |h, where S^! denotes the Euclidean unit sphere in R?. Recalling 
the representation X = WV¥ and making the substitution v = VE v’, we can write 


Omax(X) = max ||Wv||,= max max  u' Wy, 
max ( ) veS- £- at Ilo ueSPl yeSe-1(E-1) ae 


where S@!(E-!) := {v € Rf | |[Z-2y|2 = 1} is an ellipse. Consequently, obtaining bounds 
on the maximum singular value corresponds to controlling the supremum of the zero-mean 
Gaussian process {Z,,,, (u, v) € T} indexed by the set T := S”! x S4 1d"), 

We upper bound the expected value of this supremum by constructing another Gaussian 
process {Y„v, (u,v) € T} such that E[(Z,, - Zis)*] < El(Y.. — Yax)7] for all pairs (u,v) 
and (u,v) in T. We can then apply the Sudakov—Fernique comparison (Theorem 5.27) to 
conclude that 


E [Omax(X)] = | max A < E | max Yar} (6.17) 


(uvyeT (u,v)eT 


Introducing the Gaussian process Z,,, := u' Wy, let us first compute the induced pseudo- 
metric pz. Given two pairs (u, v) and (u,v), we may assume without loss of generality that 
Ilvll2 < IVli2. (f not, we simply reverse the roles of (u, v) and (u, v) in the argument to follow.) 
We begin by observing that Z,» = KW, uv')), where we use (A, BY := al a A jx B jx to 
denote the trace inner product. Since the matrix W has i.i.d. N(0, 1) entries, we have 


El(Zuy — Zax)"] = EKW, uv” -w"y)”] = Iluv” - w Ip. 


Rearranging and expanding out this Frobenius norm, we find that 
luv” — Ta je = luv —V)" + (UT Ie 
= |u — wv" IIe + Muo — 9)" Il + 2€u(y - 9)", (U— WV") 
< IPIE lu = wal + Mella lI» =V + Zlu = (ue, WKY, V) = IPP). 
Now since ||u||2 = Ifall = 1 by definition of the set T, we have llel — (u, u) > 0. On the other 


hand, we have 


Kv, D S 2 Iv IM $ 2 IMB, 


where step (i) follows from the Cauchy—Schwarz inequality, and step (ii) follows from our 
initial assumption that ||v||2 < |[v|l2. Combined with our previous bound on llul — (u, Uy, We 
conclude that 


(lallà — (u, T) (v, V) — IMIE) < 0. 


>0 <0 


Putting together the pieces, we conclude that 

luv” — uv Ihe < IPIE lle — Ti + Iv — v- 
Finally, by definition of the set S4!(Z"!), we have |P] < Fmax = Ymax( VE), and hence 
E[(Zuw — Zaa] S Ciall — W + llv — V3. 
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Motivated by this inequality, we define the Gaussian process Y, := Gmax (g, u) + (h, v), 
where g € R” and h € Rf are both standard Gaussian random vectors (i.e., with i.i.d. N(0, 1) 
entries), and mutually independent. By construction, we have 


E[(Yo — YP] = Tnaxllu — Wal + llv - Wh. 


Thus, we may apply the Sudakov—Fernique bound (6.17) to conclude that 


E[Omax(X)] < | sup Yu 


(u,v)eT 


+ | sup <h, »| 


= F max :| sup (g, u) 
veS- (£-!) 


ueSr-! 


= FmaxETllgllo] + Ell VEA 


lail, 


By Jensen’s inequality, we have E[||g||2] < Vn, and similarly, 


E[IVEAl] < VEIATEA] = ytr(d), 


which establishes the claim (6.16). 


The lower bound on the minimum singular value is based on a similar argument, but 
requires somewhat more technical work, so that we defer it to the Appendix (Section 6.6). 


6.3 Covariance matrices from sub-Gaussian ensembles 


Various aspects of our development thus far have crucially exploited different properties 
of the Gaussian distribution, especially our use of the Gaussian comparison inequalities. 
In this section, we show how a somewhat different approach—namely, discretization and 
tail bounds—can be used to establish analogous bounds for general sub-Gaussian random 
matrices, albeit with poorer control of the constants. 

In particular, let us assume that the random vector x; € R? is zero-mean, and sub-Gaussian 
with parameter at most o, by which we mean that, for each fixed v € Se 


Efe) < eF for all A € R. (6.18) 


Equivalently stated, we assume that the scalar random variable (v, x;) is zero-mean and 
sub-Gaussian with parameter at most o. (See Chapter 2 for an in-depth discussion of sub- 
Gaussian variables.) Let us consider some examples to illustrate: 


(a) Suppose that the matrix X € R”? has i.i.d. entries, where each entry x;; is zero-mean 
and sub-Gaussian with parameter 0 = 1. Examples include the standard Gaussian en- 
semble (x;; ~ N(0, 1)), the Rademacher ensemble (x;; € {—1, +1} equiprobably), and, 
more generally, any zero-mean distribution supported on the interval [-1, +1]. In all of 
these cases, for any vector v € S“', the random variable (v, x;) is sub-Gaussian with 
parameter at most o-, using the i.i.d. assumption on the entries of x; € Rf, and standard 
properties of sub-Gaussian variables. 
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(b) Now suppose that x; ~ N(0, £). For any v € S“!, we have (v, x;) ~ N(O, v'Zy). Since 
vTEv < |IIĒ£ll2, we conclude that x; is sub-Gaussian with parameter at most a? = |||XI2. 


When the random matrix X € R’“ is formed by drawing each row x; € Rf in ani.i.d. man- 
ner from a o-sub-Gaussian distribution, then we say that X is a sample from a row-wise 
o--sub-Gaussian ensemble. For any such random matrix, we have the following result: 


Theorem 6.5 There are universal constants {c ne _o such that, ee any row-wise o-sub- 


Gaussian random matrix X € R”, the sample covariance L= = D xix] satisfies the 
bounds 


Efett= 2h] < e+ forall |A] < as, (6.19a) 
and hence 
E- Y d a ee 
pl teal . £5] < coe") for all6>0.  (6.19b) 
o? n a 


Remarks: Given the bound (6.19a) on the moment generating function of the random 
variable IE — Zl, the tail bound (6.19b) is a straightforward consequence of the Chernoff 
technique (see Chapter 2). When È = I, and each x; is sub-Gaussian with parameter o = 1, 
the tail bound (6.19b) implies that 


a d d 
IÈ - Tale Z D t> 
n n 


with high probability. For n > d, this bound implies that the singular values of X/yn satisfy 


the sandwich relation 
Sg 1 e E y D re, < (6.20) 


for some universal constant c’ > 1. It is worth comparing this result to the earlier bounds 
(6.10), applicable to the special case of a standard Gaussian matrix. The bound (6.20) has a 
qualitatively similar form, except that the constant c’ is larger than one. 


Proof For notational convenience, we introduce the shorthand Q := È — E. Recall from 
Section 6.1 the variational representation |||Qll2 = maXxņ{esa: |(v, Qv)|. We first reduce the 
supremum to a finite maximum via a discretization argument (see Chapter 5). Let {v!,..., v™} 
be a ł-covering of the sphere S%! in the Euclidean norm; from Example 5.8, there exists 
such a covering with N < 17° vectors. Given any v € 2 we can write v = v/ + A for some 
vi in the cover, and an error vector A such that ||All. < H, and hence 


(v, Qv) = vi, Qv’) + 2(A, Qvi + (A, QA). 
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Applying the triangle inequality and the definition of operator norm yields 


Kv, Qv) < Kv, Qv + 2IIAll2 Ql Ivl + MQ HAI 
< Kv’, Qvi + $Q + & IQI 
< Kv’, Qv’ + 5IllQllb. 


Rearranging and then taking the supremum over v € S“!, and the associated maximum over 
J€{1,2,...,N}, we obtain 


Ql, = max |v, Qv)| <2 max |(v/, Qv’). 
yeS@-! J=1,...N 


Consequently, we have 


err 


5 [eQ] <E [exp [a max Ki, Qv) 
J=1,...N 


N 
< Abie 4 Efe 24. Qv)q (6.21) 
j=1 


J 


Next we claim that for any fixed unit vector u € S% !, 


Fjer] < pSl2beo* forall < 2 (6.22) 


32e20? * 


We take this bound as given for the moment, and use it to complete the theorem’s proof. For 
each vector v/ in the covering set, we apply the bound (6.22) twice—once with t = 2A and 
once with t = —24. Combining the resulting bounds with inequality (6.21), we find that 


2,4 
re +4d 
> 


2 404 
F [eQ] < INe BTE o < ee 


valid for all |A| < gx, where the final step uses the fact that 2(17¢) < e*!. Having estab- 


lished the moment generating function bound (6.19a), the tail bound (6.19b) follows as a 
consequence of Proposition 2.9. 


Proof of the bound (6.22): The only remaining detail is to prove the bound (6.22). By the 
definition of Q and the i.i.d. assumption, we have 


n 
5 [eX Quy = II T [e5 (i10? E] = (E [enter wu E (6.23) 


i=] 


Letting € € {-1,+1} denote a Rademacher variable, independent of xı, a standard sym- 
metrization argument (see Proposition 4.11) implies that 


o0 


k 
t t i 1/2t 
pletenou] < Ep pet] 2 S a(=) E[e* (xi, u)” ] 
tao n 


(ii) 2 1 One ap 
saol [ (x1, u ], 


f=1 


where step (i) follows by the power-series expansion of the exponential, and step (ii) follows 
since £ and x, are independent, and all odd moments of the Rademacher term vanish. By 
property (III) in Theorem 2.6 on equivalent characterizations of sub-Gaussian variables, we 
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are guaranteed that 


EE for all £ = 1,2,..., 


Efx, uy] < 


s l 


and hence 


co 2€ 
1 /2 48)! 
E, [ei uE] < 1 4 > ‘anil: ) sap Veco" 


TD 16t Mae), 
ł=1 
Fo 


where we have used the fact that (40)! < 2*[(20!]°. As long as f(t) := “e?o? < 4, we can 
write 


1+ 2 O 2 a < expose), 


ro 


where step (i) follows by summing the geometric series, and step (ii) follows because a < 


e~ for alla € [0, 1]. Putting together the pieces and combining with our earlier bound (6.23), 
we have shown that Efe] < e”P®, valid for all |t| < which establishes the 
claim (6.22). 


ea? ae 


6.4 Bounds for general matrices 


The preceding sections were devoted to bounds applicable to sample covariances under 
Gaussian or sub-Gaussian tail conditions. This section is devoted to developing extensions 
to more general tail conditions. In order to do so, it is convenient to introduce some more 
general methodology that applies not only to sample covariance matrices, but also to more 
general random matrices. The main results in this section are Theorems 6.15 and 6.17, which 
are (essentially) matrix-based analogs of our earlier Hoeffding and Bernstein bounds for 
random variables. Before proving these results, we develop some useful matrix-theoretic 
generalizations of ideas from Chapter 2, including various types of tail conditions, as well 
as decompositions for the moment generating function for independent random matrices. 


6.4.1 Background on matrix analysis 


We begin by introducing some additional background on matrix-valued functions. Recall 
the class S™ of symmetric d x d matrices. Any function f: R — R can be extended to a 
map from the set S” to itself in the following way. Given a matrix Q € S®, consider its 
eigendecomposition Q = UTTU. Here the matrix U € R® is a unitary matrix, satisfying 
the relation UTU = Iy, whereas I := diag(y(Q)) is a diagonal matrix specified by the vector 
of eigenvalues y(Q) € R. Using this notation, we consider the mapping from S““ to itself 
defined via 


Q = F(Q) := U” diag(f(71(Q)),.--, Fa(Q)))U. 
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In words, we apply the original function f elementwise to the vector of eigenvalues y(Q), 
and then rotate the resulting matrix diag(f(y(Q))) back to the original coordinate system 
defined by the eigenvectors of Q. By construction, this extension of f to S® is unitarily 
invariant, meaning that 


f(V' QV) = V' f(Q)V for all unitary matrices V € R”, 


since it affects only the eigenvalues (but not the eigenvectors) of Q. Moreover, the eigen- 
values of f(Q) transform in a simple way, since we have 


WF(Q)) = {FV j(Q), j=1,...,d}. (6.24) 


In words, the eigenvalues of the matrix f(Q) are simply the eigenvalues of Q transformed 
by f, a result often referred to as the spectral mapping property. 

Two functions that play a central role in our development of matrix tail bounds are the 
matrix exponential and the matrix logarithm. As a particular case of our construction, the 
matrix exponential has the power-series expansion e2 = 1°, = By the spectral mapping 
property, the eigenvalues of e are positive, so that it is a positive definite matrix for any 
choice of Q. Parts of our analysis also involve the matrix logarithm; when restricted to the 
cone of strictly positive definite matrices, as suffices for our purposes, the matrix logarithm 
corresponds to the inverse of the matrix exponential. 

A function f on S® is said to be matrix monotone if f(Q) < f(R) whenever Q < R. A 
useful property of the logarithm is that it is a matrix monotone function, a result known as 
the Lowner—Heinz theorem. By contrast, the exponential is not a matrix monotone function, 
showing that matrix monotonicity is more complex than the usual notion of monotonicity. 
See Exercise 6.5 for further exploration of these properties. 

Finally, a useful fact is the following: if f: IR — R is any continuous and non-decreasing 
function in the usual sense, then for any pair of symmetric matrices such that Q < R, we are 
guaranteed that 


tr(f(Q)) < tr(f(R)). (6.25) 


See the bibliographic section for further discussion of such trace inequalities. 


6.4.2 Tail conditions for matrices 


Given a symmetric random matrix Q € S*“, its polynomial moments, assuming that they 
exist, are the matrices defined by E[Q/]. As shown in Exercise 6.6, the variance of Q is a 
positive semidefinite matrix given by var(Q) := E[Q*] — (E[Q])*. The moment generating 
function of a random matrix Q is the matrix-valued mapping Yo: R > S““ given by 


Pa := Ele] = Y qr ELO“. (6.26) 
k=0 ` 


Under suitable conditions on Q—or equivalently, suitable conditions on the polynomial mo- 
ments of Q—it is guaranteed to be finite for all A in an interval centered at zero. In parallel 
with our discussion in Chapter 2, various tail conditions are based on imposing bounds on 
this moment generating function. We begin with the simplest case: 
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Definition 6.6 A zero-mean symmetric random matrix Q € S““@ is sub-Gaussian with 
matrix parameter V € S% if 


Wold) <e™ forall Ac R. (6.27) 


This definition is best understood by working through some simple examples. 


Example 6.7 Suppose that Q = eB where e € {—1,+1} is a Rademacher variable, and 
B € S% is a fixed matrix. Random matrices of this form frequently arise as the result of 
symmetrization arguments, as discussed at more length in the sequel. Note that we have 
E[Q***'] = 0 and E[Q”] = B% for all k = 1,2,..., and hence 


x a% 2 1 (22B2\F æ 
-r AQI 2k A _ fe 
le I= ope al 2 ) Re Sg 


showing that the sub-Gaussian condition (6.27) holds with V = B? = var(Q). & 


As we show in Exercise 6.7, more generally, a random matrix of the form Q = gB, where 
g € R is a o-sub-Gaussian variable with distribution symmetric around zero, satisfies the 
condition (6.27) with matrix parameter V = 0B’. 


Example 6.8 As an extension of the previous example, consider a random matrix of the 
form Q = £C, where £ is a Rademacher variable as before, and C is now a random matrix, 
independent of £ with its spectral norm bounded as |||Clllz < b. First fixing C and taking 


2a 
expectations over the Rademacher variable, the previous example yields E,[e*’] < ere, 


2. 2 
Since |IIClll2 < b, we have et© < etl and hence 


2 
Yo(4) < er hla for all A € R, 
showing that Q is sub-Gaussian with matrix parameter V = b7Ij. + 


In parallel with our treatment of scalar random variables in Chapter 2, it is natural to con- 
sider various weakenings of the sub-Gaussian requirement. 


Definition 6.9 (Sub-exponential random matrices) A zero-mean random matrix is 
sub-exponential with parameters (V, œ) if 


Wo(A)<e foralljaj< Ł. (6.28) 


Thus, any sub-Gaussian random matrix is also sub-exponential with parameters (V, 0). How- 
ever, there also exist sub-exponential random matrices that are not sub-Gaussian. One ex- 
ample is the zero-mean random matrix M = eg’B, where s € {—1,+1} is a Rademacher 
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variable, the variable g ~ N(O, 1) is independent of £, and B is a fixed symmetric matrix. 


The Bernstein condition for random matrices provides one useful way of certifying the 
sub-exponential condition: 


Definition 6.10 (Bernstein’s condition for matrices) A zero-mean symmetric random 
matrix Q satisfies a Bernstein condition with parameter b > 0 if 


FIQ} < $j! b7? var(Q) fOr A (6.29) 


We note that (a stronger form of) Bernstein’s condition holds whenever the matrix Q has 
a bounded operator norm—say |||Q|ll, < b almost surely. In this case, it can be shown (see 
Exercise 6.9) that 


F[Q/] < b? var(Q) forall j = 3,4,.... (6.30) 


Exercise 6.11 gives an example of a random matrix with unbounded operator norm for which 
Bernstein’s condition holds. 

The following lemma shows how the general Bernstein condition (6.29) implies the sub- 
exponential condition. More generally, the argument given here provides an explicit bound 
on the moment generating function: 


Lemma 6.11 For any symmetric zero-mean random matrix satisfying the Bernstein 
condition (6.29), we have 


2? var(Q) 


1 


Yo44) < exp| 


Proof Since E[Q] = 0, applying the definition of the matrix exponential for a suitably 
small A € R yields 


2 = J! 
© 2? ected 
Op p ZO | | wo 

2 0 


7 a? var(Q) P ` AE[Q’] 


j= 


TT 
Gi 2? var(Q) 
aig 21 = bA)” 


where step (i) applies the Bernstein condition, step (ii) is valid for any |A| < 1/b, a choice 
for which the geometric series is summable, and step (iii) follows from the matrix inequality 
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I, +A < eĉ, which is valid for any symmetric matrix A. (See Exercise 6.4 for more discus- 
sion of this last property.) 


6.4.3 Matrix Chernoff approach and independent decompositions 


The Chernoff approach to tail bounds, as discussed in Chapter 2, is based on controlling the 
moment generating function of a random variable. In this section, we begin by showing that 
the trace of the matrix moment generating function (6.26) plays a similar role in bounding 
the operator norm of random matrices. 


Lemma 6.12 (Matrix Chernoff technique) Let Q be a zero-mean symmetric random 
matrix whose moment generating function Yo exists in an open interval (—a, a). Then 
for any 6 > 0, we have 


Plymax(Q) > ô] < tr(Pa(Ae” — forall A € [0,a), (6.32) 
where tr(-) denotes the trace operator on matrices. Similarly, we have 


PINQl > 8] <2tr(PgAe" — forall A € [0,a). (6.33) 


< 


Proof For each 4 € [0, a), we have 


Plymax(Q) > 6] = Ple™“® > e] 2 PLymax(e*2) > e”), (6.34) 


where step (i) uses the functional calculus relating the eigenvalues of AQ to those of e?2. 
Applying Markov’s inequality yields 


Plymax(e*2) =e") < Elymax(e*?)le” Š Eltre’). (6.35) 


Here inequality (i) uses the upper bound ymax(e*2) < tr(e*2), which holds since e?® is posi- 
tive definite. Finally, since trace and expectation commute, we have 


E[tr(e*®)] = tr(Efe"@]) = tr(¥Q(A)). 


Note that the same argument can be applied to bound the event ymax(—Q) = 6, or equiva- 
lently the event Ymin(Q) < —6. Since |IIQlll2 = max{ymax(Q), lYmin(Q)|}, the tail bound on the 
operator norm (6.33) follows. 


An important property of independent random variables is that the moment generating 
function of their sum can be decomposed as the product of the individual moment gener- 
ating functions. For random matrices, this type of decomposition is no longer guaranteed 
to hold with equality, essentially because matrix products need not commute. However, for 
independent random matrices, it is nonetheless possible to establish an upper bound in terms 
of the trace of the product of moment generating functions, as we now show. 
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Lemma 6.13 Let Q),...,Q, be independent symmetric random matrices whose mo- 
ment generating functions exist for all A € I, and define the sum S, := X; Qi. Then 


CPs (A) < t(D) for all AE I. (6.36) 


Remark: In conjunction with Lemma 6.12, this lemma provides an avenue for obtaining 
tail bounds on the operator norm of sums of independent random matrices. In particular, if 
we apply the upper bound (6.33) to the random matrix S,,/n, we find that 


ellie 


Proof In order to prove this lemma, we require the following result due to Lieb (1973): for 
any fixed matrix H € S*“, the function f: S®4 — R given by 


f(A) = tr(e" +08) 


> 5 < 2 trees Ero) ev” for all A € [0, a). (6.37) 
2 


is concave. Introducing the shorthand notation G(A) := tr(¥s, (A)), we note that, by linearity 
of trace and expectation, we have 


G(A) = tr( T pe eee) SE Spa T Q. [tr(eS»-1 Hog exPQn))) 


Using concavity of the function f with H = AS,- and A = e*®", Jensen’s inequality implies 
that 


Ee [tr(e7S»-1 +08 exp) < tr(e?5-1 tos Eq, exp(4Qn)) 


so that we have shown that G(A) < Es, [tr(e?S»-:+08 Yo ®)], 
We now recurse this argument, in particular peeling off the term involving Q,,_;, so that 
we have 


G(A) < E s, EQ [ tr(e?Sr-2+log Yon (A)+log expQn-1))] 
> n-2 n-1 è 


We again exploit the concavity of the function f, this time with the choices H = AS,_2 + 
log Yo, (A) and A = e°% thereby finding that 


G(A) < E s Hi tr(e?Sr-2+log Po, Atlog Ye, @)). 


Continuing in this manner completes the proof of the claim. 


In many cases, our goal is to bound the maximum eigenvalue (or operator norm) of sums 
of centered random matrices of the form Q; = A; — E[A,]. In this and other settings, it is 
often convenient to perform an additional symmetrization step, so that we can deal instead 
with matrices Q; that are guaranteed to have distribution symmetric around zero (meaning 
that Q; and —Q; follow the same distribution). 


Example 6.14 (Rademacher symmetrization for random matrices) Let {Aj}, be a se- 
quence of independent symmetric random matrices, and suppose that our goal is to bound 
the maximum eigenvalue of the matrix sum >)" (A; — E[A;]). Since the maximum eigen- 
value can be represented as the supremum of an empirical process, the symmetrization tech- 
niques from Chapter 4 can be used to reduce the problem to one involving the new matrices 
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Q; = ¢;A;, where s; is an independent Rademacher variable. Let us now work through this 
reduction. By Markov’s inequality, we have 


Yo J-E Ann) ‘ | A ie 


i=1 


P 


By the variational representation of the maximum eigenvalue, we have 


F [era lA EAD] — E ew | sup (u (Za -E ‘sn})} 
i=l 


llull2=1 


2 eleso (2a sun (u (Sa) 


eN i [ema Di eA] 


Gi) 


ELymax(e >m A0], 


where inequality (i) makes use of the symmetrization inequality from Proposition 4.11(b) 
with ®(t) = e”, and step (ii) uses the spectral mapping property (6.24). Continuing on, we 
have 


T [Ymae Deel sA] < tr( F [e din aa < tfe? Beto} ; 


where the final step follows from applying Lemma 6.13 to the symmetrized matrices Q; = 
e;A;. Consequently, apart from the factor of 2, we may assume without loss of general- 
ity when bounding maximum eigenvalues that our matrices have a distribution symmetric 
around zero. & 


6.4.4 Upper tail bounds for random matrices 


We now have collected the ingredients necessary for stating and proving various tail bounds 
for the deviations of sums of zero-mean independent random matrices. 


Sub-Gaussian case 


We begin with a tail bound for sub-Gaussian random matrices. It provides an approximate 
analog of the Hoeffding-type tail bound for random variables (Proposition 2.5). 


Theorem 6.15 (Hoeffding bound for random matrices) Let {Q;}?_, be a sequence of 
zero-mean independent symmetric random matrices that satisfy the sub-Gaussian con- 
dition with parameters {V;}""_,. Then for all 6 > 0, we have the upper tail bound 


fe 
e| no 


where o? = |l} Xi Villz- 


> a| k È v] E nls (6.38) 
2 i=1 


6.4 Bounds for general matrices 175 


Proof We first prove the claim in the case when V := 7, V; is full-rank, and then show 
how to prove the general case. From Lemma 6.13, it suffices to upper bound tr(eX" log Yo; w), 
From Definition 6.6, the assumed sub-Gaussianity, and the monotonicity of the matrix log- 
arithm, we have 


dee Yo, (4) x a yy. 


where we have used the fact that the logarithm is matrix monotone. Now since the exponen- 
tial is an increasing function, the trace bound (6.25) implies that 


tr(e2ir eaw) < ufe? Zii "i l 


This upper bound, when combined with the matrix Chernoff bound (6.37), yields 


eže, 


For any oe Symmetric matrix R, we have tr(e®) < ise Applying this inequal- 
ity to the matrix R = ayn 1 Vi, for which we have |||Rlll2 = Eno, yields the bound 


[že 


This upper bound holds for all A > 0 and setting A = 5/o” yields the claim. 

Now suppose that the matrix V := ))7., V; is not full-rank, say of rank r < d. In this 
case, an eigendecomposition yields V = UDUT, where U € R®” has orthonormal columns. 
Introducing the shorthand Q := $; Q;, the r-dimensional matrix Q = U'QU then captures 
all randomness in Q, and in particular we have |||Qlll2 = IIIQll2. We can thus apply the same 
argument to bound |||Qlll2, leading to a pre-factor of r instead of d. 


> 5 < 2ufe? Xi Je ee 


E < 2de"? =Anő 


An important fact is that inequality (6.38) also implies an analogous bound for general 
independent but potentially non-symmetric and/or non-square matrices, with d replaced 
by (dı + d2). More specifically, a problem involving general zero-mean random matrices 
A; € R“* can be transformed to a symmetric version by defining the (d, + d2)-dimensional 
square matrices 


Oa xa, A; | 
i = > 6.39 

Q | AT Oxa ( ) 
and imposing some form of moment generating function bound—for instance, the sub- 
Gaussian condition (6.27)—on the symmetric matrices Q;. See Exercise 6.10 for further 
details. 


A significant feature of the tail bound (6.38) is the appearance of either the rank or the 
dimension d in front of the exponent. In certain cases, this dimension-dependent factor is 
superfluous, and leads to sub-optimal bounds. However, it cannot be avoided in general. The 
following example illustrates these two extremes. 
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Example 6.16 (Looseness/sharpness of Theorem 6.15) For simplicity, let us consider ex- 
amples with n = d. For each i = 1,2,...,d, let E; € S® denote the diagonal matrix with 
1 in position (i,i), and Os elsewhere. Define Q; = y;E;, where {y;}"_, is an i.i.d. sequence 
of 1-sub-Gaussian variables. Two specific cases to keep in mind are Rademacher variables 
{e;}"_,, and N(O, 1) variables {g;}"_,. 

For any such choice of sub-Gaussian variables, a calculation similar to that of Exam- 
ple 6.7 shows that each Q; satisfies the sub-Gaussian bound (6.27) with V; = E;, and 
hence o? = Il Sh Vill2 = 1/d. Consequently, an application of Theorem 6.15 yields the 
tail bound 


252 


> 5 <2de* — foralld>0, (6.40) 
2 


1 d 
‘lade 
log(2d 
d 


which implies that ll Ei Q;ll < VSCO with high probability. On the other hand, an 
explicit calculation shows that 


ly + lyil 
zef -a 6an 

Comparing the exact result (6.41) with the bound (6.40) yields a range of behavior. At one 
extreme, for i.i.d. Rademacher variables y; = s; € {-1,+1}, we have Il X1 Qil = 1/d, 
showing that the bound (6.40) is off by the order ./log d. On the other hand, for i.i.d. Gaus- 
sian variables y; = g; ~ N(0, 1), we have 


[5 Sol] = Ht » 


Peo. ee = ee 


max — 
a) d d dáns 


using the fact that the maximum of d i.i.d. N(0, 1) variables scales as ./2 log d. Conse- 
quently, Theorem 6.15 cannot be improved for this class of random matrices. & 


Bernstein-type bounds for random matrices 


We now turn to bounds on random matrices that satisfy sub-exponential tail conditions, in 
particular of the Bernstein form (6.29). 


Theorem 6.17 (Bernstein bound for random matrices) Let {Q;}_;} be a sequence of 
independent, zero-mean, symmetric random matrices that satisfy the Bernstein condi- 
tion (6.29) with parameter b > 0. Then for all 6 = 0, the operator norm satisfies the 


tail bound 
1 n 

P| — ' 
|; 2,2 


where o* := || £5- var(Q))lll 


z no 
A > | < 2rank [> vara) exp{ — com} (6.42) 


i=1 
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Proof By Lemma 6.13, we have tr(¥s,(4)) < tr(e2 oeta w), By Lemma 6.11, the Bern- 
stein condition combined with matrix monotonicity of the logarithm yields the bound 


log Yo,(4) < THE for any |A| < t. Putting together the pieces yields 


n Vv Z i z nào? 
(eX log Yo; w < uexp [eee < rank 2 varie, 


where the final inequality follows from the same argument as the proof of Theorem 6.15. 
Combined with the upper bound (6.37), we find that 


e[l 2 Q; < 2a Ss va) 1 -an 


i=l 
valid for all A € [0,1/b). Setting 4 = 


>ô 
2 


€ (0, 1) and simplifying yields the claim (6.42). 


lees 
o2+b6 


Remarks: Note that the tail bound (6.42) is of the sub-exponential type, with two regimes 
of behavior depending on the relative sizes of the parameters o? and b. Thus, it is a nat- 
ural generalization of the classical Bernstein bound for scalar random variables. As with 
Theorem 6.15, Theorem 6.17 can also be generalized to non-symmetric (and potentially 
non-square) matrices {A;}?_, by introducing the sequence of {Q;}?, symmetric matrices de- 
fined in equation (6.39), and imposing the Bernstein condition on it. As one special case, if 
IIA;ll2 < b almost surely, then it can be verified that the matrices {Q;}?_, satisfy the Bernstein 
condition with b and the quantity 


’ 


2 


o? = max { 


1 
i ` - [A;A7] 


i=1 


g 
5 os C [Aj Aj] 


i=1 


l. (6.43) 
2 
We provide an instance of this type of transformation in Example 6.18 to follow. 


The problem of matrix completion provides an interesting class of examples in which 
Theorem 6.17 can be fruitfully applied. See Chapter 10 for a detailed description of the 
underlying problem, which motivates the following discussion. 


Example 6.18 (Tail bounds in matrix completion) Consider an i.i.d. sequence of matrices 
of the form A; = &X; € R”, where é; is a zero-mean sub-exponential variable that satisfies 
the Bernstein condition with parameter b and variance y2, and X; is a random “mask matrix”, 
independent from &;, with a single entry equal to d in a position chosen uniformly at random 
from all d? entries, and all remaining entries equal to zero. By construction, for any fixed 
matrix © € R®, we have E[(A;, ©))”] = v"|||Oll2—a property that plays an important role 
in our later analysis of matrix completion. 

As noted in Example 6.14, apart from constant factors, there is no loss of generality in 
assuming that the random matrices A; have distributions that are symmetric around zero; 
in this particular, this symmetry condition is equivalent to requiring that the scalar random 
variables é; and —&; follow the same distribution. Moreover, as defined, the matrices A; are 
not symmetric (meaning that A; + AT), but as discussed following Theorem 6.17, we can 
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bound the operator norm II XL, A;lll2 in terms of the operator norm II 1 Qillla, where the 
symmetrized version Q; € R? was defined in equation (6.39). 

By the independence between é; and A; and the symmetric distribution of é, we have 
F [Q21] = 0 for all m = 0, 1,2, .... Turning to the even moments, suppose that entry (a, b) 
is the only non-zero in the mask matrix X;. We then have 


D, 0 


Q” = (E) "d" for all m = 1,2,..., (6.44) 
0 D 
where D, € R® is the diagonal matrix with a single 1 in entry (a, a), with D, defined anal- 
ogously. By the Bernstein condition, we have E[é?"] < }(2m)!b*"*y? for all m = 1,2,.... 
On the other hand, E[D,] = la since the probability of choosing a in the first coordinate 


is 1/d. We thus see that var(Q,) = v°dlz4. Putting together the pieces, we have shown that 


1 1 1 
EIQ") < 5(2m)!b vd" og = 5 (2m) bd)" var(Qi), 


showing that Q; satisfies the Bernstein condition with parameters bd and 


1 n 
oO i= |: > var(Q,)|l|_ < vd. 
nel 2 
Consequently, Theorem 6.17 implies that 
1% -P 
e| = YA: > ô| < 4de 20», (6.45) 
n h 


& 


In certain cases, it is possible to sharpen the dimension dependence of Theorem 6.17—in 
particular, by replacing the rank-based pre-factor, which can be as large as d, by a quantity 
that is potentially much smaller. We illustrate one instance of such a sharpened result in the 
following example. 


Example 6.19 (Bernstein bounds with sharpened dimension dependence) Consider a se- 
quence of independent zero-mean random matrices Q; bounded as ||Qj|ll2 < 1 almost surely, 
and suppose that our goal is to upper bound the maximum eigenvalue Ymax(S,,) of the sum 
S, := YL, Qi. Defining the function ¢(A) := e? — A — 1, we note that it is monotonically 
increasing on the positive real line. Consequently, as verified in Exercise 6.12, for any pair 
6 > 0, we have 


Plymaa(S,) 2 6] < inf or ae (6.46) 
Moreover, using the fact that |||Q,|ll, < 1, the same exercise shows that 
log Yo, (4) < @(A) var(Q;) (6.47a) 
and E 
tr(ETA(AS,))) < EE esane, (6.476) 


IVI 
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where V := "7, var(Q,). Combined with the initial bound (6.46), we conclude that 


tr(V) {cP livle 
oe Vip eo 4 
[Ymax(Sn) > ô] < 7 i int { Hb) | (6 8) 


The significance of this bound is the appearance of the trace ratio 


Š Ville 
opposed to the quantity rank(V) < d that arose in our previous derivation. Note that we 
always have Wh < rank(V), and in certain cases, the trace ratio can be substantially smaller 
than the rank. See Exercise 6.13 for one such case. & 


as a pre-factor, as 


6.4.5 Consequences for covariance matrices 


We conclude with a useful corollary of Theorem 6.17 for the estimation of covariance 
matrices. 


Corollary 6.20 Let x;,...,X, be i.i.d. zero-mean random vectors with covariance X 
such that \|xjll. < Vb almost surely. Then for all 6 > 0, the sample covariance matrix 


= spe T 3 
X = - Lin Xix; satisfies 


TE 62 
PIE - Zll > ô] < 2dexp l-a] l (6.49) 


< 4 


Proof We apply Theorem 6.17 to the zero-mean random matrices Q; := x;x; — X. These 
matrices have controlled operator norm: indeed, by the triangle inequality, we have 


2 
IQ: < [villa + W2lll < b + NZI- 


Since X = E[xix/], we have |[Zll> = maxsu E[(v, x;)7] < b, and hence |IQ;ll2 < 2b. 
Turning to the variance of Q;, we have 


var(Q;) = Efx] — E? < Elli xx] < bE, 


so that || var(Q;)|ll2 < bIIXll2. Substituting into the tail bound (6.42) yields the claim. 


Let us illustrate some consequences of this corollary with some examples. 


Example 6.21 (Random vectors uniform on a sphere) Suppose that the random vectors 
x; are chosen uniformly from the sphere Stl(Vd), so that ||xill, = Vd for all i =1,...,n. 
By construction, we have E [xx] = XZ = [;, and hence |||. = 1. Applying Corollary 6.20 
yields 


PIE- Lll > 6] < 2de forall 6 2 0. (6.50) 


=A dlogd dlogd 
IE - Ilk = (= + — (6.51) 


This bound implies that 
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with high probability, so that the sample covariance is a consistent estimate as long as 
ceed — 0. This result is close to optimal, with only the extra logarithmic factor being 
superfluous in this particular case. It can be removed, for instance, by noting that x; is a 
sub-Gaussian random vector, and then applying Theorem 6.5. & 


Example 6.22 (“Spiked” random vectors) We now consider an ensemble of random vec- 
tors that are rather different than the previous example, but still satisfy the same bound. In 
particular, consider a random vector of the form x; = Vd eai Where a(i) is an index chosen 
uniformly at random from {1,...,d}, and eq) € IR? is the canonical basis vector with 1 in 
position a(i). As before, we have ||x;||2 = Vd, and E [xx] = I, so that the tail bound (6.50) 
also applies to this ensemble. An interesting fact is that, for this particular ensemble, the 
bound (6.51) is sharp, meaning it cannot be improved beyond constant factors. & 


6.5 Bounds for structured covariance matrices 


In the preceding sections, our primary focus has been estimation of general unstructured 
covariance matrices via the sample covariance. When a covariance matrix is equipped with 
additional structure, faster rates of estimation are possible using different estimators than the 
sample covariance matrix. In this section, we explore the faster rates that are achievable for 
sparse and/or graph-structured matrices. 

In the simplest setting, the covariance matrix is known to be sparse, and the positions of 
the non-zero entries are known. In such settings, it is natural to consider matrix estimators 
that are non-zero only in these known positions. For instance, if we are given a priori know- 
ledge that the covariance matrix is diagonal, then it would be natural to use the estimate 
D := diag(Z11, Sy. ie Èa}, corresponding to the diagonal entries of the sample covariance 
matrix È. As we era in Exercise 6.15, the performance of this estimator can be substan- 
tially better: in particular, for sub-Gaussian variables, it achieves an estimation error of the 


order ./—= 108 as opposed to the order Ji rates in the unstructured setting. Similar statements 
apply to Sine forms of known sparsity. 


6.5.1 Unknown sparsity and thresholding 


More generally, suppose that the covariance matrix X is known to be relatively sparse, but 
that the positions of the non-zero entries are no longer known. It is then natural to consider 
estimators based on thresholding. Given a parameter A > 0, the hard-thresholding operator 
is given by 


u if |u| > A, 
T,(u) := u [lu] > A] = i (6.52) 

0 otherwise. 
With a minor abuse of notation, for a matrix M, we write Ta(M) for the matrix obtained 
by applying the thresholding operator to each element of M. In this section, we study the 
performance of the estimator T,,(2), where the parameter 4, > 0 is suitably chosen as a 

function of the sample size n and matrix dimension d. 

The sparsity of the covariance matrix can be measured in various ways. Its zero pattern 
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is captured by the adjacency matrix A € R® with entries A jp = I[Zj¢ + 0]. This adjacency 
matrix defines the edge structure of an undirected graph G on the vertices {1,2,...,d}, with 
edge (j, £) included in the graph if and only if Xj + 0, along with the self-edges (j, j) for 
each of the diagonal entries. The operator norm ||Alll2 of the adjacency matrix provides a 
natural measure of sparsity. In particular, it can be verified that |All, < d, with equality 
holding when G is fully connected, meaning that X has no zero entries. More generally, as 
shown in Exercise 6.2, we have |I|All2 < s whenever X has at most s non-zero entries per 
row, or equivalently when the graph G has maximum degree at most s — 1. The following 
result provides a guarantee for the thresholded sample covariance matrix that involves the 
graph adjacency matrix A defined by =. 


Theorem 6.23 (Thresholding-based covariance estimation) Let {x;}_, be an i.i.d. se- 
quence of zero-mean random vectors with covariance matrix X, and suppose that each 
component x;; is sub-Gaussian with parameter at most o. If n > logd, then for any 
logd 


ô > 0, the thresholded sample covariance matrix T, (2) with A, /o* =8 -> + ô sat- 


isfies 


PIT ® - Zll > 2AN] < 8675 E, (6.53) 
h 4 


Underlying the proof of Theorem 6.23 is the following (deterministic) result: for any 
choice of 4, such that ||E — X]|max < An, we are guaranteed that 


ITa, ®© - Zll < 2MNAlllaan. (6.54) 


The proof of this intermediate claim is straightforward. First, for any index pair (j, £) such 
that ÈX; = 0, the bound |[Z—LZ]|max < An guarantees that |Z j¢| < An, and hence that Ta, (È ;e) = 0 


by definition of the thresholding operator. On the other hand, for any pair (j, £) for which 
Xie # 0, we have 


IF 1, Bie) — Eel $ (Ta, Bye) - Epl + Bye Eje $ 2m 
where step (i) follows from the triangle inequality, and step (ii) follows from the fact that 
(Ty — Èl < An, and a second application of the assumption E- L]|max < An. Con- 
sequently, we have shown that the matrix B := |T, ®©) — L| satisfies the elementwise in- 
equality B < 24,A. Since both B and A have non-negative entries, we are guaranteed that 
||Blll2 < 24.lllAlll2, and hence that ITa, ®) — Llp < 2A, ||Alll2 as claimed. (See Exercise 6.3 
for the details of these last steps.) 


Theorem 6.23 has a number of interesting corollaries for particular classes of covariance 
matrices. 
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Corollary 6.24 Suppose that, in addition to the conditions of Theorem 6.23, the co- 


; , ; 5 logd 
variance matrix X has at most s non-zero entries per row. Then with A,/o7 = 8 = +0 
for some 6 > 0, we have 


PIZ ® - Llp = 254] < 8675 MIME H, (6.55) 


In order to establish these claims from Theorem 6.23, it suffices to show that |l|Alll, < s. 


Since A has at most s ones per row (with the remaining entries equal to zero), this claim 
follows from the result of Exercise 6.2. 


Example 6.25 (Sparsity and adjacency matrices) In certain ways, the bound (6.55) is more 
appealing than the bound (6.53), since it is based on a local quantity—namely, the maxi- 
mum degree of the graph defined by the covariance matrix, as opposed to the spectral norm 
I||Alll2. In certain cases, these two bounds coincide. As an example, consider any graph with 
maximum degree s — 1 that contains an s-clique (i.e., a subset of s nodes that are all joined 
by edges). As we explore in Exercise 6.16, for any such graph, we have |||Alllz = s, so that 
the two bounds are equivalent. 


O O 
2 O 


O 


(a) (b) 


Figure 6.1 (a) An instance of a graph on d = 9 nodes containing an s = 5 clique. 
For this class of graphs, the bounds (6.53) and (6.55) coincide. (b) A hub-and-spoke 
graph on d = 9 nodes with maximum degree s = 5. For this class of graphs, the 
bounds differ by a factor of ys. 


However, in general, the bound (6.53) can be substantially sharper than the bound (6.55). 
As an example, consider a hub-and-spoke graph, in which one central node known as the 
hub is connected to s of the remaining d — 1 nodes, as illustrated in Figure 6.1(b). For such 
a graph, we have |All = 1 + Vs — 1, so that in this case Theorem 6.23 guarantees that 


=~ slogd 
lI7.,) - 2l < 4 oes, 


with high probability, a bound that is sharper by a factor of order ys compared to the 
bound (6.55) from Corollary 6.24. & 
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We now turn to the proof of the remainder of Theorem 6.23. Based on the reasoning lead- 
ing to equation (6.54), it suffices to establish a high-probability bound on the elementwise 
infinity norm of the error matrix A := X — È. 


Lemma 6.26 Under the conditions of Theorem 6.23, we have 


P[llAllmax/O7 > t] < 8675 mint H42logd  forallt > 0. (6.56) 


Setting t = A,/o? = 8 |“ + 6 in the bound (6.56) yields 


P[[Allmax 2 An] < 8e7% MME), 


where we have used the fact that n > log d by assumption. 

It remains to prove Lemma 6.26. Note that the rescaled vector x;/o is sub-Gaussian with 
parameter at most 1. Consequently, we may assume without loss of generality that o = 1, 
and then rescale at the end. First considering a diagonal entry, the result of Exercise 6.15(a) 
guarantees that there are universal positive constants cı, c2 such that 


PIA; 2 c18] <2e°"" ~— forall ô € 0,1). (6.57) 


Turning to the non-diagonal entries, for any j + £, we have 
ae De | gees eo 
2Ae =- X yxu -22y = > SC +x) — (Ejj + Eu +22) a A; — Aee. 
i=l i=l 
Since x;; and x;¢ are both zero-mean and sub-Gaussian with parameter o, the sum xij + Xie 
is zero-mean and sub-Gaussian with parameter at most 2V20 (see Exercise 2.13(c)). Con- 
sequently, there are universal constants c2, c3 such that for all 6 € (0, 1), we have 


1 n 
P É Di + xie)’ — Ejj + Zee + 2Dje) 


2 
> a < 26°02" 


and | hence, combining with our earlier diagonal bound (6.57), we obtain the tail bound 
P[lAjel = c48] < 6er? Finally, combining this bound with the earlier inequality (6.57) 
and then taking a union bound over all d? entries of the matrix yields the stated claim (6.56). 


6.5.2 Approximate sparsity 


Given a covariance matrix X with no entries that are exactly zero, the bounds of Theo- 
rem 6.23 are very poor. In particular, for a completely dense matrix, the associated adjacency 
matrix A is simply the all-ones matrix, so that |||Alll,z = d. Intuitively, one might expect that 
these bounds could be improved if X had a large number of non-zero entries, but many of 
them were “near zero”. 

Recall that one way in which to measure the sparsity of X is in terms of the maximum 
number of non-zero entries per row. A generalization of this idea is to measure the ¢,-“norm” 
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of each row. More specifically, given a parameter q € [0,1] and a radius R}, we impose the 
constraint 


d 
a2 [Djel? < Ry. (6.58) 
(See Figure 7.1 in Chapter 7 for an illustration of these types of sets.) In the special case 
q = 0, this constraint is equivalent to requiring that each row of X have at most Ro non- 
zero entries. For intermediate values q € (0, 1], it allows for many non-zero entries but 
requires that their absolute magnitudes (if ordered from largest to smallest) drop off rela- 
tively quickly. 


Theorem 6.27 (Covariance estimation under ¢,-sparsity) Suppose that the covari- 
ance matrix & satisfies the €,-sparsity constraint (6.58). Then for any A, such that 
IE — Xlmax < 2, /2, we are guaranteed that 


ITa, ® - Fb s4R (6.59a) 


Consequently, if the sample covariance is formed using i.i.d. samples {x;}"_, that are 


zero-mean with sub-Gaussian parameter at most o, then with A,,/o07 = 8 wed +ô, we 


have 


PIT, ®© — Llp > 4R,a,!4] < 8e7 ME) forall ô> 0. (6.59b) 


Proof Given the deterministic claim (6.59a), the probabilistic bound (6.59b) follows from 
standard tail bounds on sub-exponential variables. The deterministic claim is based on the 
assumption that E —Xlmax < 4/2. By the result of Exercise 6.2, the operator norm can be 
upper bounded as 


jesse 


Fixing an index j € {1,2,...,d}, define the set S;(4,/2) = {€ € {1,...,d} | [Ej > a,/2}. 
For any index £ € S$ ;(A,,/2), we have 


3 Be. feat oo 3 
IT, Èj) — Bjel < (Ta, je) — Ziel + Eje- Viel < 7% 


On the other hand, for any index £ ¢ S;(A,/2), we have T}, @ o) = 0, by definition of the 
thresholding operator, and hence 


IT, © — Dyel = lE. 
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Putting together the pieces, we have 


d 
D)lPaGid — Ziel = DY) TED- D1 (Ta, Bye) — Le 
¢=1 CES (An) CES j(An) 
3 
< SAADA Y, Bel (6.60) 
CES j(An) 


Now we have 


An IR 


q? 


Diez 
=| G) 


Àn IÈ jel © An 
yy] = — re 
È Pid = 5 3 ~a? Ano} > 


CES j(An/2) CES j(An/2) 


where step (i) follows since |Z j¢| < A,/2 for all € ¢ S,(A,/2) and q € [0,1], and step (ii) 
follows by the assumption (6.58). On the other hand, we have 


CES j(An/2) ( 


d av 
Ry > X El > ISD (Z) . 
{=l 
whence |S ;(A,,/2)| < 21RA”. Combining these ingredients with the inequality (6.60), we 
find that 


d 
> TapE) -El < 2R AÈ + RA} < 4R A1". 
t=1 

Since this same argument holds for each index j = 1,...,d, the claim (6.59a) follows. 


6.6 Appendix: Proof of Theorem 6.1 


It remains to prove the lower bound (6.9) on the minimal singular value. In order to do 
so, we follow an argument similar to that used to upper bound the maximal singular value. 
Throughout this proof, we assume that È is strictly positive definite (and hence invertible); 
otherwise, its minimal singular value is zero, and the claimed lower bound is vacuous. We 
begin by lower bounding the expectation using a Gaussian comparison principle due to Gor- 
don (1985). By definition, the minimum singular value has the variational representation 
Omin(X) = minyesa |[Xv'||2. Let us reformulate this representation slightly for later theoret- 
ical convenience. Recalling the shorthand notation Onin = OVE): we define the radius 
R=1/Gmin, and then consider the set 


VR) := {z € R? | || VYEzlk2 = 1, lizz < R}. (6.61) 


We claim that it suffices to show that a lower bound of the form 


min: Sanai kd (6.62) 
zEV(R) yn n 


holds with probability at least 1 — e™””/?, Indeed, suppose that inequality (6.62) holds. Then 
for any v’ € S¢-! we can define the rescaled vector z := TEE" By construction, we have 
vilz 


1 2 1 - R, 
I VEvll2 minl VE) 


IVEzl2=1 and |lzlb = 
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so that z € V(R). We now observe that 


I|Xv'|lo F Xel IXzll2 
—— = |V} y 
yn l ea yn = Cane VIR) yn 


Since this bound holds for all v’ € S*!, we can take the minimum on the left-hand side, 
thereby obtaining 


ee [Xv'|lo Eo a [Xzll2 
vest! yn eV) yn 
© tE) 
Saas Rap 6 
n 


= (1-6)Omin — R y ue) > 
n 


where step (i) follows from the bound (6.62). 

It remains to prove the lower bound (6.62). We begin by showing concentration of the 
random variable minev) ||Xvl|l2/Vn around its expected value. Since the matrix X € [p74 
has i.i.d. rows, each drawn from the N(0, X) distribution, we can write X = WvxX, where 
the random matrix W is standard Gaussian. Using the fact that |\VZv|b = 1 for all v eV (R), 


it follows that the function W > miner) a is Lipschitz with parameter L = 1/-yn. 
Applying Theorem 2.26, we conclude that 
IXvll2 _ | St 
min —— > min ——|- 
veViR) yn veV(R) y/n 
nd? /2_ 


with probability at least 1 — e 
Consequently, the proof will be complete if we can show that 


E| min Ze 
veV(R) yn 
In order to do so, we make use of an extension of the Sudakov—Fernique inequality, known 


as Gordon’s inequality, which we now state. Let {Z,,,} and {Y,,,} be a pair of zero-mean 
Gaussian processes indexed by a non-empty index set T = U x V. Suppose that 


| Sie Re (6.63) 


n 


(Zu — Zax)’ | < Ely = Yx Yx)"] for all pairs (u, v) and (u,v) € T, (6.64) 


and moreover that this inequality holds with equality whenever v = v. Under these condi- 
tions, Gordon’s inequality guarantees that 


E| max min Z, 1] s <E | max min Y; al (6.65) 


veV ucU veV ucU 


In order to exploit this result, we first observe that 


— n \[Xz||o = RG IXzl2}= max min u'Xz. 
ze zeV(R) uES"-! 


As before, if we introduce the standard Gaussian random matrix W € R’*¢, then for any 
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z € V(R), we can write u'Xz = u™Wv, where v := VEz. Whenever z € V(R), then the 
vector v must belong to the set V’(R) := {v € S4! | [2-2 vo < R}, and we have shown that 


m IXzl = max min u'Wv. 
ze veV’(R) ues! —~— 


U,V 


Let (u, v) and (ù, v) be any two members of the Cartesian product space S”! x V’(R). Since 
llull = Ih = lvl = IPl&2 = 1, following the same argument as in bounding the maximal 
singular value shows that 


E(u, v), @,¥)) < llu -WÊ + lv -— WE, (6.66) 


with equality holding when v = v. Consequently, if we define the Gaussian process Y„, := 
(g, u) + (h, v}, where g € R” and h € R? are standard Gaussian vectors and mutually inde- 
pendent, then we have 


py((u, v), ŒV) = llu — W3 + llv — Wb, 


so that the Sudakov—Fernique increment condition (6.64) holds. In addition, for a pair such 
that v = y, equality holds in the upper bound (6.66), which guarantees that pz((u, v), (u, v)) = 
py((u, v), (u, v)). Consequently, we may apply Gordon’s inequality (6.65) to conclude that 


IA 


J- min IXa] 


| max min A 
zeV(R) 


veV'(R) ueSr-! 


ueS” veV’(R) 


= in (g, | :| max <h, »| 


< —Efligllo] + E[l] VEAIDIR, 


where we have used the upper bound |(h, v)| = |< v=A, r? v)| < || VEA|R, by definition of 
the set V’(R). 
We now claim that 


EU VZM] — Ella] 
vr) ~ vd © 


(6.67) 


Indeed, by the rotation invariance of the Gaussian distribution, we may assume that & is 
diagonal, with non-negative entries {y nye p and the claim is equivalent to showing that the 
function F(y) := E[( ye S-17 h” ?] achieves its maximum over the probability simplex at the 
uniform vector (i.e., with all entries y; = 1/d). Since F is continuous and the probability sim- 
plex is compact, the maximum is achieved. By the rotation invariance of the Gaussian, the 
function F is also permutation invariant—i.e., F(y) = F(II(y)) for all permutation matrices 
TI. Since F is also concave, the maximum must be achieved at y; = 1/d, which establishes 
the inequality (6.67). 
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Recalling that R = 1/F min, we then have 


ve) E[l] 
T min Vd 


2 
= (-Elllgll] + mnf “= -1| F(lMil 


Tı 


F[Ilgllo] + R ENI VEAllo] < -Elligll2] + 


By Jensen’s inequality, we have EllAll2 < 4/E l4 = Vd. Since ar > 1, we conclude that 
ao. d 
n > d, shows that Tı < — yn + Vd. Combining the pieces, we conclude that 


E| - min IXclb| < -va vā] aa — 1} Vd 
zeV(R) 


Tr ind 
ie 


O min 


Tə < { oe) i} Vd. On the other hand, a direct calculation, using our assumption that 


min 


which establishes the initial claim (6.62), thereby completing the proof. 


6.7 Bibliographic details and background 


The two-volume series by Horn and Johnson (1985; 1991) is a standard reference on linear 
algebra. A statement of Weyl’s theorem and its corollaries can be found in section 4.3 of the 
first volume (Horn and Johnson, 1985). The monograph by Bhatia (1997) is more advanced 
in nature, taking a functional-analytic perspective, and includes discussion of Lidskii’s the- 
orem (see section II.4). The notes by Carlen (2009) contain further background on trace 
inequalities, such as inequality (6.25). 

Some classical papers on asymptotic random matrix theory include those by Wigner (1955; 
1958), Maréenko and Pastur (1967), Pastur (1972), Wachter (1978) and Geman (1980). 
Mehta (1991) provides an overview of asymptotic random matrix theory, primarily from 
the physicist’s perspective, whereas the book by Bai and Silverstein (2010) takes a more 
Statistical perspective. The lecture notes of Vershynin (2011) focus on the non-asymptotic 
aspects of random matrix theory, as partially covered here. Davidson and Szarek (2001) 
describe the use of Sudakov—Fernique (Slepian) and Gordon inequalities in bounding ex- 
pectations of random matrices; see also the earlier papers by Gordon (1985; 1986; 1987) 
and Szarek (1991). The results in Davidson and Szarek (2001) are for the special case of the 
standard Gaussian ensemble (2 = I,), but the underlying arguments are easily extended to 
the general case, as given here. 

The proof of Theorem 6.5 is based on the lecture notes of Vershynin (2011). The under- 
lying discretization argument is classical, used extensively in early work on random con- 
structions in Banach space geometry (e.g., see the book by Pisier (1989) and references 
therein). Note that this discretization argument is the one-step version of the more sophisti- 
cated chaining methods described in Chapter 5. 

Bounds on the expected operator norm of a random matrix follow a class of results known 
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as non-commutative Bernstein inequalities, as derived initially by Rudelson (1999). Alh- 
swede and Winter (2002) developed techniques for matrix tail bounds based on controlling 
the matrix moment generating function, and exploiting the Golden—Thompson inequality. 
Other authors, among them Oliveira (2010), Gross (2011) and Recht (2011), developed var- 
ious extensions and refinements of the original Ahlswede—Winter approach. Tropp (2010) 
introduced the idea of controlling the matrix generating function directly, and developed 
the argument that underlies Lemma 6.13. Controlling the moment generating function in 
this way leads to tail bounds involving the variance parameter o° := Ll Ye var(Q,)lll2 
as opposed to the potentially larger quantity 6? := 1 X; Ill var(Q)ll2 that follows from the 
original Ahlswede—Winter argument. By the triangle inequality for the operator norm, we 
have o? < ð, and the latter quantity can be substantially larger. Independent work by 
Oliveira (2010) also derived bounds involving the variance parameter o°, using a related 
technique that sharpened the original Ahlswede—Winter approach. Tropp (2010) also pro- 
vides various extensions of the basic Bernstein bound, among them results for matrix martin- 
gales as opposed to the independent random matrices considered here. Mackey et al. (2014) 
show how to derive matrix concentration bounds with sharp constants using the method of 
exchangeable pairs introduced by Chatterjee (2007). Matrix tail bounds with refined forms of 
dimension dependence have been developed by various authors (Minsker, 2011; Hsu et al., 
2012a); the specific sharpening sketched out in Example 6.19 and Exercise 6.12 is due to 
Minsker (2011). 

For covariance estimation, Adamczak et al. (2010) provide sharp results on the deviation 
I£ — ||. for distributions with sub-exponential tails. These results remove the superfluous 
logarithmic factor that arises from an application of Corollary 6.20 to a sub-exponential 
ensemble. Srivastava and Vershynin (2013) give related results under very weak moment 
conditions. For thresholded sample covariances, the first high-dimensional analyses were 
undertaken in independent work by Bickel and Levina (2008a) and El Karoui (2008). Bickel 
and Levina studied the problem under sub-Gaussian tail conditions, and introduced the row- 
wise sparsity model, defined in terms of the maximum f,-“norm” taken over the rows. By 
contrast, El Karoui imposed a milder set of moment conditions, and measured sparsity in 
terms of the growth rates of path lengths in the graph; this approach is essentially equivalent 
to controlling the operator norm |||Alll2 of the adjacency matrix, as in Theorem 6.23. The star 
graph is an interesting example that illustrates the difference between the row-wise sparsity 
model, and the operator norm approach. 

An alternative model for covariance matrices is a banded decay model, in which entries 
decay according to their distance from the diagonal. Bickel and Levina (2008b) introduced 
this model in the covariance setting, and proposed a certain kind of tapering estimator. Cai 
et al. (2010) analyzed the minimax-optimal rates associated with this class of covariance 
matrices, and provided a modified estimator that achieves these optimal rates. 


6.8 Exercises 


Exercise 6.1 (Bounds on eigenvalues) Given two symmetric matrices A and B, show di- 
rectly, without citing any other theorems, that 


bYmax(A) — Ymax(B)I < IIA - Bll, and |Ymin(A) — Ymin(B) < IIA — Billo. 
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Exercise 6.2 (Relations between matrix operator norms) For a rectangular matrix A with 
real entries and a scalar q € [1, co], the (€, — €,)-operator norms are given by 
|All, = sup ||Axtlq. 
|xllg=1 
(a) Derive explicit expressions for the operator norms |||Alllz, IIA; and IIAll in terms of 
elements and/or singular values of A. 
(b) Prove that ||AB|l, < IIA; Bll, for any size-compatible matrices A and B. 


(c) For a square matrix A, prove that IAL < |All Alls. What happens when A is sym- 
metric? 


Exercise 6.3 (Non-negative matrices and operator norms) Given two d-dimensional sym- 
metric matrices A and B, suppose that 0 < A < B in an elementwise sense (i.e.,0 < A je < Bie 
for all j,€ =1,...,d.) 


(a) Show that 0 < A” < B” for all integers m = 1,2,.... 

(b) Use part (a) to show that ||Alll2 < |||Blll2. 

(c) Use a similar argument to show that |||C|ll2 < ||| IC] I2 for any symmetric matrix C, where 
|C| denotes the absolute value function applied elementwise. 


Exercise 6.4 (Inequality for matrix exponential) Let A € S““ be any symmetric matrix. 
Show that I, + A < eê. (Hint: First prove the statement for a diagonal matrix A, and then 
show how to reduce to the diagonal case.) 


Exercise 6.5 (Matrix monotone functions) A function f: S% — S“4 on the space of 
symmetric positive semidefinite matrices is said to be matrix monotone if 


f(A) < fB) whenever A < B. 
Here < denotes the positive semidefinite ordering on S®“. 


(a) Show by counterexample that the function f(A) = A? is not matrix monotone. (Hint: 
Note that (A+/C)* = A?+??C?+1(AC+CA), and search for a pair of positive semidefinite 
matrices such that AC + CA has a negative eigenvalue.) 

(b) Show by counterexample that the matrix exponential function f(A) = e 
monotone. (Hint: Part (a) could be useful.) 

(c) Show that the matrix logarithm function f(A) = log A is matrix monotone on the cone 
of strictly positive definite matrices. (Hint: You may use the fact that g(A) = A? is 
matrix monotone for all p € [0, 1].) 


A is not matrix 


Exercise 6.6 (Variance and positive semidefiniteness) Recall that the variance of a sym- 
metric random matrix Q is given by var(Q) = E[Q?] — (E[Q])*. Show that var(Q) > 0. 


Exercise 6.7 (Sub-Gaussian random matrices) Consider the random matrix Q = gB, where 
g € Ris a zero-mean o-sub-Gaussian variable. 


(a) Assume that g has a distribution symmetric around zero, and B € S““ is a determinis- 
tic matrix. Show that Q is sub-Gaussian with matrix parameter V = c°o?B?, for some 
universal constant c. 
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(b) Now assume that B € S® is random and independent of g, with |||B|ll, < b almost surely. 
Prove that Q is sub-Gaussian with matrix parameter given by V = c?07b7Iy. 


Exercise 6.8 (Sub-Gaussian matrices and mean bounds) Consider a sequence of indepen- 
dent, zero-mean random matrices {Q,}7_, in S*4, each sub-Gaussian with matrix parameter 
V;. In this exercise, we provide bounds on the expected value of eigenvalues and operator 
norm of S,, = 1 X Q. 


(a) Show that E[ymax(Sn)] < 4/284, where o? = It X Vil- 
(Hint: Start by showing that Ejer S»] < de® ) 


(b) Show that 
Do2 
| . |< 20° log(2d)_ (6.68) 
n 


Exercise 6.9 (Bounded matrices and Bernstein condition) Let Q € S““ be an arbitrary 
symmetric matrix. 


(a) Show that the bound ||Q|||, < b implies that Q}? < b/7Iy. 


(b) Show that the positive semidefinite order is preserved under left-right multiplication, 
meaning that if A < B, then we also have QAQ < QBQ for any matrix Q € S% 


(c) Use parts (a) and (b) to prove the inequality (6.30). 


Exercise 6.10 (Tail bounds for non-symmetric matrices) In this exercise, we prove that a 
version of the tail bound (6.42) holds for general independent zero-mean matrices {Aj}? | 
that are almost surely bounded as |||A;lll, < b, as long as we adopt the new definition (6.43) 
of o°. 


(a) Given a general matrix A; € R“*®, define a symmetric matrix of dimension (d, + dy) 


via 
— [Oaxa. Ai 
Q: := | AT PA 


Prove that |IQ;ll2 = IA;ll2- 
(b) Prove that |l|+ £; var(Q;)lll, < o° where o? is defined in equation (6.43). 
(c) Conclude that 
| 


Exercise 6.11 (Unbounded matrices and Bernstein bounds) Consider an independent se- 
quence of random matrices {Aj}"_, in R“*“, each of the form A; = g;B;, where g; € R is 
a zero-mean scalar random variable, and B; is an independent random matrix. Suppose that 
Figi < £b/ 0? for j = 2,3,..., and that |IB;ll2 < by almost surely. 


> nd| < Xdi + dye, (6.69) 


2 


i=1 
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(a) For any 6 > 0, show that 


ll 


(Hint: The result of Exercise 6.10(a) could be useful.) 
(b) Show that 


52 


né 


< (di + de EHH, 


>ó 
2 


20b 4b,b 
< A log(di + dz) + Vr} + ——{log(di + da) +1). 
n 


(Hint: The result of Exercise 2.8 could be useful.) 


ea 
on 


i=1 


2 


Exercise 6.12 (Sharpened matrix Bernstein inequality) In this exercise, we work through 
various steps of the calculation sketched in Example 6.19. 


(a) Prove the bound (6.46). 
(b) Show that for any symmetric zero-mean random matrix Q such that IIIQ] < 1 almost 
surely, the moment generating function is bounded as 


A — — 
log ¥o(4) < (e° — A — 1) var(Q). 
ga) 
(c) Prove the upper bound (6.47b). 


Exercise 6.13 (Bernstein’s inequality for vectors) In this exercise, we consider the problem 
of obtaining a Bernstein-type bound on random variable || Xj; x;ll2, where {x;¥}; is an i.i.d. 
sequence of zero-mean random vectors such that ||x;||2 < 1 almost surely, and cov(x;) = X. 
In order to do so, we consider applying either Theorem 6.17 or the bound (6.48) to the 
(d + 1)-dimensional symmetric matrices 

x; 04 


Q; := 5 A 


Define the matrix V, = (7, var(Q)). 


(a) Show that the best bound obtainable from Theorem 6.17 will have a pre-factor of the 
form rank() + 1, which can be as large as d + 1. 

(b) By way of contrast, show that the bound (6.48) yields a dimension-independent pre- 
factor of 2. 


Exercise 6.14 (Random packings) The goal of this exercise is to prove that there exists a 
collection of vectors P = {6',...,6™} belonging to the sphere S%! such that: 


(a) the set P forms a 1/2-packing in the Euclidean norm; 
(b) the set P has cardinality M > e®f for some universal constant co; 
(c) the inequality |<; vie (4/ @ 6)|Ilb < 4 holds. 


(Note: You may assume that d is larger than some universal constant so as to avoid annoying 
subcases.) 
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Exercise 6.15 (Estimation of diagonal covariances) Let {x;}"_, be an i.i.d. sequence of d- 
dimensional vectors, drawn from a zero-mean distribution with diagonal covariance matrix 
x = D. Consider the estimate D = diag(X), where & is the usual sample covariance matrix. 


(a) When each vector x; is sub-Gaussian with parameter at most o, show that there are 
universal positive constants c; such that 


log d 


P| (ID — Dil /o? > co4/ = + 6| < cere) for all 6 > 0. 
n 


(b) Instead of a sub-Gaussian tail condition, suppose that for some even integer m > 2, there 
is a universal constant K, such that 
E[(x;,-Zj)"]< Ky  foreachi=1,...,nand j=1,...,d. 
ee—~——“~— 
l-E; 


Show that 


< d? a 
P| IID — Dll = 464| —| < K’| — for all 6 > 0, 
n 26 


where K/, is another universal constant. 
Hint: You may find Rosenthal’s inequality useful: given zero-mean independent random 
variables Z; such that ||Z;||,, < +20, there is a universal constant C„ such that 


sağa Eeer") 


i=1 i=1 


n 


Sa 


i=1 


Exercise 6.16 (Graphs and adjacency matrices) Let G be a graph with maximum degree 
s — | that contains an s-clique. Letting A denote its adjacency matrix (defined with ones on 
the diagonal), show that |||Alll2 = s. 


7; 


Sparse linear models in high dimensions 


The linear model is one of the most widely used in statistics, and has a history dating back to 
the work of Gauss on least-squares prediction. In its low-dimensional instantiation, in which 
the number of predictors d is substantially less than the sample size n, the associated theory 
is classical. By contrast, our aim in this chapter is to develop theory that is applicable to the 
high-dimensional regime, meaning that it allows for scalings such that d x n, or even d > n. 
As one might intuitively expect, if the model lacks any additional structure, then there is no 
hope of obtaining consistent estimators when the ratio d/n stays bounded away from zero.! 
For this reason, when working in settings in which d > n, it is necessary to impose additional 
structure on the unknown regression vector 6* € Rf, and this chapter focuses on different 
types of sparse models. 


7.1 Problem formulation and applications 


Let 6* € R? be an unknown vector, referred to as the regression vector. Suppose that we 
observe a vector y € R” and a matrix X € R’ that are linked via the standard linear model 


y=Xő +w, (7.1) 


where w € R” is a vector of noise variables. This model can also be written in a scalarized 
form: for each index i = 1,2,...,n, we have y; = (x;, 0} + wi, where He e R¢ is the ith 
row of X, and y; and w; are (respectively) the ith entries of the vectors y and w. The quantity 
(xi, O*) := Ja Xi 50; denotes the usual Euclidean inner product between the vector x; € R? 
of predictors (or covariates), and the regression vector @* € R. Thus, each response y; is a 
noisy version of a linear combination of d covariates. 

The focus of this chapter is settings in which the sample size n is smaller than the number 
of predictors d. In this case, it can also be of interest in certain applications to consider a 
noiseless linear model, meaning the special case of equation (7.1) with w = 0. When n < d, 
the equations y = X6* define an underdetermined linear system, and the goal is to understand 
the structure of its sparse solutions. 


7.1.1 Different sparsity models 


At the same time, when d > n, it is impossible to obtain any meaningful estimates of 6° 
unless the model is equipped with some form of low-dimensional structure. One of the 


' Indeed, this intuition will be formalized as a theorem in Chapter 15 using information-theoretic methods. 
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(a) (b) (c) 


Figure 7.1 Illustrations of the £,-“balls” for different choices of the parameter q € 
(0, 1]. (a) For q = 1, the set B,(R,) corresponds to the usual ¢\-ball shown here. 
(b) For g = 0.75, the ball is a non-convex set obtained by collapsing the faces of 
the ¢,-ball towards the origin. (c) For q = 0.5, the set becomes more “spiky”, and it 
collapses into the hard sparsity constraint as q > 0*. As shown in Exercise 7.2(a), 
for all q € (0, 1], the set B,(1) is star-shaped around the origin. 


simplest kinds of structure in a linear model is a hard sparsity assumption, meaning that the 
set 


S(@) := {j € {1,2,...,d} | 6; #0}, (7.2) 


known as the support set of 6°, has cardinality s := |S(0*)| substantially smaller than d. 
Assuming that the model is exactly supported on s coefficients may be overly restrictive, 
in which case it is also useful to consider various relaxations of hard sparsity, which leads 
to the notion of weak sparsity. Roughly speaking, a vector 6” is weakly sparse if it can be 
closely approximated by a sparse vector. 

There are different ways in which to formalize such an idea, one way being via the @,- 
“norms”. For a parameter q € [0, 1] and radius R, > 0, consider the set 


d 
Si ail" < Ry}. (7.3) 


j=l 


B,(R,) = fo e R? 


It is known as the £,-ball of radius R,. As illustrated in Figure 7.1, for q € [0, 1), it is not a 
ball in the strict sense of the word, since it is a non-convex set. In the special case g = 0, any 
vector 6° € Bo(Ro) can have at most s = Ro non-zero entries. More generally, for values of 
q in (0, 1], membership in the set B,(R,) has different interpretations. One of them involves 
how quickly the ordered coefficients 


Ay] > Wyle =I yl > [Ora (7.4) 
“ aa “— 


max |6*| min |6°| 
j=l,2,...d J j=l2,...d J 
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decay. More precisely, as we explore in Exercise 7.2, if these ordered coefficients satisfy 
the bound IQ yh < Cj for a suitable exponent a, then 6* belongs to B,(R,) for a radius R} 
depending on (C, a). 


7.1.2 Applications of sparse linear models 


Although quite simple in appearance, the high-dimensional linear model is fairly rich. We 
illustrate it here with some examples and applications. 


Example 7.1 (Gaussian sequence model) In a finite-dimensional version of the Gaussian 
sequence model, we make observations of the form 


yi = Vn; +w; fori =1,2,...,n, (7.5) 


where w; ~ N(0,c7) are iid. noise variables. This model is a special case of the general 
linear regression model (7.1) with n = d, and a design matrix X = ynI,. It is a truly high- 
dimensional model, since the sample size n is equal to the number of parameters d. Although 
it appears simple on the surface, it is a surprisingly rich model: indeed, many problems in 
nonparametric estimation, among them regression and density estimation, can be reduced 
to an “equivalent” instance of the Gaussian sequence model, in the sense that the optimal 
rates for estimation are the same under both models. For nonparametric regression, when 
the function f belongs to a certain type of function class (known as a Besov space), then the 
vector of its wavelet coefficients belongs to a certain type of €,-ball with q € (0, 1), so that 
the estimation problem corresponds to a version of the Gaussian sequence problem with an 
€,-sparsity constraint. Various methods for estimation, such as wavelet thresholding, exploit 
this type of approximate sparsity. See the bibliographic section for additional references on 
this connection. & 


Example 7.2 (Signal denoising in orthonormal bases) Sparsity plays an important role 
in signal processing, both for compression and for denoising of signals. In abstract terms, a 
signal can be represented as a vector 6* € R“. Depending on the application, the signal length 
d could represent the number of pixels in an image, or the number of discrete samples of a 
time series. In a denoising problem, one makes noisy observations of the form y = 6* + w, 
where the vector w corresponds to some kind of additive noise. Based on the observation 
vector y € R4, the goal is to “denoise” the signal, meaning to reconstruct 6* as accurately 
as possible. In a compression problem, the goal is to produce a representation of 8*, either 
exact or approximate, that can be stored more compactly than its original representation. 
Many classes of signals exhibit sparsity when transformed into an appropriate basis, such 
as a wavelet basis. This sparsity can be exploited both for compression and for denoising. In 
abstract terms, any such transform can be represented as an orthonormal matrix ¥ € Rd, 
constructed so that 6* := 6" e R? corresponds to the vector of transform coefficients. 
If the vector 6* is known to be sparse, then it can be compressed by retaining only some 
number s < d of its coefficients, say the largest s in absolute value. Of course, if 6" were 
exactly sparse, then this representation would be exact. It is more realistic to assume that 
6” satisfies some form of approximate sparsity, and, as we explore in Exercise 7.2, such 
conditions can be used to provide guarantees on the accuracy of the reconstruction. 
Returning to the denoising problem, in the transformed space, the observation model takes 
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the form y = 6* + w, where y := Wy and w := W'w are the transformed observation and 
noise vector, respectively. When the observation noise is assumed to be 1.i.d. Gaussian (and 
hence invariant under orthogonal transformation), then both the original and the transformed 
observations are instances of the Gaussian sequence model from Example 7.1, both with 
n=d. 

If the vector 6* is known to be sparse, then it is natural to consider estimators based on 
thresholding. In particular, for a threshold 4 > 0 to be chosen, the hard-thresholded estimate 
of 6* is defined as 


[HAG yi if ly 2 A, (7.6a) 
i= 6a 
es 0 otherwise. 
Closely related is the soft-thresholded estimate given by 
sign(y; (ly — A) if vil 2 A, 
ost ee as (7.6b) 
0 otherwise. 


As we explore in Exercise 7.1, each of these estimators have interpretations as minimizing 
the quadratic cost function 0 + ||y — Alls subject to lo- and ¢-constraints, respectively. + 


Example 7.3 (Lifting and nonlinear functions) Despite its superficial appearance as repre- 
senting purely linear functions, augmenting the set of predictors allows for nonlinear models 
to be represented by the standard equation (7.1). As an example, let us consider polynomial 
functions in a scalar variable t € R of degree k, say of the form 


folt) =O, +t- + Opt. 


Suppose that we observe n samples of the form {(y;, ¢;)}/_,, where each pair is linked via the 
observation model y; = fo(t;) + wi. This problem can be converted into an instance of the 


linear regression model by using the sample points (t1, . . . , tn) to define the n x (k + 1) matrix 
lt Éo te 
z lb É- É 
lth Bow k 


When expressed in this lifted space, the polynomial functions are linear in 8, and so we can 
write the observations {(y;, ¢;)}7_, in the standard vector form y = X9 + w. 

This lifting procedure is not limited to polynomial functions. The more general setting is 
to consider functions that are linear combinations of some set of basis functions—say of the 
form 


b 
HO = X040, 


j=l 


where {1,...,,} are some known functions. Given n observation pairs (y;, t;), this model 
can also be reduced to the form y = X8 + w, where the design matrix X € R’“@ has entries 
Xij = jt). 


Although the preceding discussion has focused on univariate functions, the same ideas 
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apply to multivariate functions, say in D dimensions. Returning to the case of polynomial 
functions, we note that there are G ) possible multinomials of degree k in dimension D. This 
leads to the model dimension growing exponentially as D*, so that sparsity assumptions 
become essential in order to produce manageable classes of models. & 


Example 7.4 (Signal compression in overcomplete bases) We now return to an extension 
of the signal processing problem introduced in Example 7.2. As we observed previously, 
many classes of signals exhibit sparsity when represented in an appropriate basis, such as 
a wavelet basis, and this sparsity can be exploited for both compression and denoising pur- 
poses. Given a signal y € R”, classical approaches to signal denoising and compression 
are based on orthogonal transformations, where the basis functions are represented by the 
columns of an orthonormal matrix ¥ € R”*”. However, it can be useful to consider an over- 
complete set of basis functions, represented by the columns of a matrix X € R? with d > n. 
Within this framework, signal compression can be performed by finding a vector @ € R? such 
that y = X9. Since X has rank n, we can always find a solution with at most n non-zero co- 
ordinates, but the hope is to find a solution @* € R? with ||6"||) = s < n non-zeros. 

Problems involving f-constraints are computationally intractable, so that it is natural to 
consider relaxations. As we will discuss at more length later in the chapter, the ¢,-relaxation 
has proven very successful. In particular, one seeks a sparse solution by solving the convex 
program 


d 
6 € arg min 2 l0; such that y = X8. 


SS 
Alli 


Later sections of the chapter will provide theory under which the solution to this ¢,-relaxation 
is equivalent to the original £)-problem. & 


Example 7.5 (Compressed sensing) Compressed sensing is based on the combination of 
€,-relaxation with the random projection method, which was previously described in Ex- 
ample 2.12 from Chapter 2. It is motivated by the inherent wastefulness of the classical 
approach to exploiting sparsity for signal compression. As previously described in Exam- 
ple 7.2, given a signal 6* € Rf, the standard approach is first to compute the full vector 
6° = YB" € R! of transform coefficients, and then to discard all but the top s coefficients. 
Is there a more direct way of estimating $*, without pre-computing the full vector 6 of its 
transform coefficients? 

The compressed sensing approach is to take n « d random projections of the original 
signal 6* € Rf, each of the form y; = (x;, B*) := Di x; 8", where x; € R is a random vector. 
Various choices are possible, including the standard Gaussian ensemble (x;; ~ N(0, 1), 
i.i.d.), or the Rademacher ensemble (x;; € {-1, +1}, i.i.d.). Let X € R”*1 be a measurement 
matrix with x/ as its ith row and y € R” be the concatenated set of random projections. In 
matrix—vector notation, the problem of exact reconstruction amounts to finding a solution 
B € R? of the underdetermined linear system X8 = X* such that YTS is as sparse as 
possible. Recalling that y = Xf", the standard ¢,-relaxation of this problem takes the form 
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MINgeRd |All, such that y = X£, or equivalently, in the transform domain, 


min||6l|; such that y = X9, (7.7) 
OERI 
where X := XY. In asserting this equivalence, we have used the orthogonality relation 


YY" = I}. This is another instance of the basis pursuit linear program (LP) with a random 
design matrix X. 

Compressed sensing is a popular approach to recovering sparse signals, with a number 
of applications. Later in the chapter, we will develop theory that guarantees the success of 
€,-relaxation for the random design matrices that arise from taking random projections. # 


Example 7.6 (Selection of Gaussian graphical models) Any zero-mean Gaussian random 
vector (Z;,...,Z,) with a non-degenerate covariance matrix has a density of the form 


1 
Pozi,- 552d) = exp(—4z'@*2), 
(27)! det((@*)~!) 


where ©* € R?“ is the inverse covariance matrix, also known as the precision matrix. 
For many interesting models, the precision matrix is sparse, with relatively few non-zero 
entries. The problem of Gaussian graphical model selection, as discussed at more length in 
Chapter 11, is to infer the non-zero entries in the matrix ©*. 

This problem can be reduced to an instance of sparse linear regression as follows. For a 
given index s € V := {1,2,...,d}, suppose that we are interested in recovering its neighbor- 
hood, meaning the subset N(s) := {t € V | ©%, + 0}. In order to do so, imagine performing a 
linear regression of the variable Z, on the (d— 1)-dimensional vector Z\;., := {Z,, t € V \{s}}. 
As we explore in Exercise 11.3 in Chapter 11, we can write 


Zs =( “yy .) + Ws, 
=. “s 


response y predictors 


where w, is a zero-mean Gaussian variable, independent of the vector Z\;,,. Moreover, the 
vector 6* € R&! has the same sparsity pattern as the sth off-diagonal row (O*,, t € V \ {s} 
of the precision matrix. 4 


7.2 Recovery in the noiseless setting 


In order to build intuition, we begin by focusing on the simplest case in which the obser- 
vations are perfect or noiseless. More concretely, we wish to find a solution 6 to the linear 
system y = X6, where y € R” and X € R™@ are given. When d > n, this is an underdeter- 
mined set of linear equations, so that there is a whole subspace of solutions. But what if we 
are told that there is a sparse solution? In this case, we know that there is some vector 6* € R? 
with at most s « d non-zero entries such that y = X6*. Our goal is to find this sparse so- 
lution to the linear system. This noiseless problem has applications in signal representation 
and compression, as discussed in Examples 7.4 and 7.5. 
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7.2.1 €\-based relaxation 


This problem can be cast as a (non-convex) optimization problem involving the fo-“norm”. 
Let us define 


d 


Allo := > * 06; # O1, 


j=l 


where the function t + [[t + 0] is equal to one if t + 0, and zero otherwise. Strictly speaking, 
this is not a norm, but it serves to count the number of non-zero entries in the vector 6 € R°. 
We now consider the optimization problem 


min ||4]|o such that X0 = y. (7.8) 
ERI 


If we could solve this problem, then we would obtain a solution to the linear equations that 
has the fewest number of non-zero entries. 

But how to solve the problem (7.8)? Although the constraint set is simply a subspace, 
the cost function is non-differentiable and non-convex. The most direct approach would 
be to search exhaustively over subsets of the columns of X. In particular, for each subset 
S c {1,...,d}, we could form the matrix X; € R”%S! consisting of the columns of X indexed 
by S, and then examine the linear system y = Xs0 to see whether or not it had a solution 
6 € R'°!. If we iterated over subsets in increasing cardinality, then the first solution found 
would be the sparsest solution. Let’s now consider the associated computational cost. If the 
sparsest solution contained s non-zero entries, then we would have to search over at least 
Ji (‘) subsets before finding it. But the number of such subsets grows exponentially in s, 
so the procedure would not be computationally feasible for anything except toy problems. 

Given the computational difficulties associated with €)-minimization, a natural strategy 
is to replace the troublesome fo-objective by the nearest convex member of the @,-family, 
namely the ¢\-norm. This is an instance of a convex relaxation, in which a non-convex op- 
timization problem is approximated by a convex program. In this setting, doing so leads to 
the optimization problem 


min ||6||, such that XO = y. (7.9) 
ERI 


Unlike the fo-version, this is now a convex program, since the constraint set is a subspace 
(hence convex), and the cost function is piecewise linear and thus convex as well. More 
precisely, the problem (7.9) is a linear program, since any piecewise linear convex cost can 
always be reformulated as the maximum of a collection of linear functions. We refer to the 
optimization problem (7.9) as the basis pursuit linear program, after Chen, Donoho and 
Saunders (1998). 


7.2.2 Exact recovery and restricted nullspace 


We now turn to an interesting theoretical question: when is solving the basis pursuit pro- 
gram (7.9) equivalent to solving the original -problem (7.8)? More concretely, let us sup- 
pose that there is a vector 6* € R? such that y = X6", and moreover, the vector @* has support 
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S c {1,2,...,d}, meaning that 0; = 0 for all j e S° (where S° denotes the complement 
of S). 

Intuitively, the success of basis pursuit should depend on how the nullspace of X is related 
to this support, as well as the geometry of the ¢\-ball. To make this concrete, recall that the 
nullspace of X is given by null(X) := {A € R? | XA = 0}. Since X6* = y by assumption, 
any vector of the form 6* + A for some A € null(X) is feasible for the basis pursuit program. 
Now let us consider the tangent cone of the €,-ball at 6*, given by 


T(@*) = {A € Rf | ||O* + tAllı < |I6" ||; for some t > 0}. (7.10) 


As illustrated in Figure 7.2, this set captures the set of all directions relative to 6* along which 
the £;-norm remains constant or decreases. As noted earlier, the set &* + null(X), drawn with 
a solid line in Figure 7.2, corresponds to the set of all vectors that are feasible for the basis 
pursuit LP. Consequently, if 6° is the unique optimal solution of the basis pursuit LP, then it 
must be the case that the intersection of the nullspace null(X) with this tangent cone contains 
only the zero vector. This favorable case is shown in Figure 7.2(a), whereas Figure 7.2(b) 
shows the non-favorable case, in which 6* need not be optimal. 


6* + null(X) 


&* + null(X) 


G + T0) 


(a) (b) 


Figure 7.2 Geometry of the tangent cone and restricted nullspace property in d = 2 
dimensions. (a) The favorable case in which the set 6* + null(X) intersects the tangent 
cone only at 6*. (b) The unfavorable setting in which the set 6* + null(X) passes 
directly through the tangent cone. 


This intuition leads to a condition on X known as the restricted nullspace property. Let 
us define the subset 
C(S) = {A € R° | lAselli < llAslh}, 


corresponding to the cone of vectors whose £1-norm off the support is dominated by the £4- 
norm on the support. The following definition links the nullspace of a matrix X to this set: 
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Definition 7.7 The matrix X satisfies the restricted nullspace property with respect to 
S if C(S) N nullCX) = {0}. 


As shown in the proof of Theorem 7.8 to follow, the difference set C(S) provides an al- 
ternative way of capturing the behavior of the tangent cone 1(6*), one that is independent 
of 6. In particular, the proof establishes that, for any S-sparse vector 6", the tangent cone 
T(6*) is contained within C(S), and conversely, that C(S) is contained in the union of such 
tangent cones. More precisely, the restricted nullspace property is equivalent to the success 
of the basis pursuit LP in the following sense: 


Theorem 7.8 The following two properties are equivalent: 


(a) For any vector © € R? with support S, the basis pursuit program (7.9) applied 
with y = X@& has unique solution 6 = 6. 
(b) The matrix X satisfies the restricted nullspace property with respect to S. 


Proof We first show that (b) = (a). Since both @ and & are feasible for the basis pursuit 
program, and since Gis optimal, we have iah < ||6*||,. Defining the error vector A:=60- Oo, 
we have 


Ost = Wel = l0 + Alh 
= l65 + Ash + lAselli 
> [65th = WAsth + llAselli, 


where we have used the fact that 6%. = 0, and applied the triangle inequality. Rearranging 
this inequality, we conclude that the error Ae C(S). However, by construction, we also have 
XA = 0, so A € null(X) as well. By our assumption, this implies that A = 0, or equivalently 
that 0 = 6°. 

In order to establish the implication (a) = (b), it suffices to show that, if the ¢)-relaxation 
succeeds for all S-sparse vectors, then the set null(X) \ {0} has no intersection with C(S). 
For a given vector 0* € null(X) \ {0}, consider the basis pursuit problem 


min ||6||; such that X8 = xl i (7.11) 
BERI 0 
By assumption, the unique optimal solution will be B =[6; 0J". Since X6* = 0 by assump- 
tion, the vector [0 — Gl" is also feasible for the problem, and, by uniqueness, we must 
have ||@¢|l1 < |l@ell1, implying that 6° ¢ C(S) as claimed. 


7.2.3 Sufficient conditions for restricted nullspace 


In order for Theorem 7.8 to be a useful result in practice, one requires a certificate that 
the restricted nullspace property holds. The earliest sufficient conditions were based on the 
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incoherence parameter of the design matrix, namely the quantity 


X;,X 
Spw(X) = K ml 


—ILj=k]], (7.12) 
where X, denotes the jth column of X, and Ij = k] denotes the {0, 1}-valued indicator for 
the event {j = k}. Here we have chosen to rescale matrix columns by 1/~n, as it makes 
results for random designs more readily interpretable. 

The following result shows that a small pairwise incoherence is sufficient to guarantee a 
uniform version of the restricted nullspace property. 


Proposition 7.9 Ifthe pairwise incoherence satisfies the bound 
1 
dpw(X) < =, (7.13) 
35 


then the restricted nullspace property holds for all subsets S of cardinality at most s. 


We guide the reader through the steps involved in the proof of this claim in Exercise 7.3. 


A related but more sophisticated sufficient condition is the restricted isometry property 
(RIP). It can be understood as a natural generalization of the pairwise incoherence condi- 
tion, based on looking at conditioning of larger subsets of columns. 


la >) 
Definition 7.10 (Restricted isometry property) For a given integer s € {1,...,d}, we 
say that X € R’ satisfies a restricted isometry property of order s with constant 
6;(X) > 0 if 


In this definition, we recall that ||| - ||, denotes the f2-operator norm of a matrix, correspond- 
ing to its maximum singular value. For s = 1, the RIP condition implies that the rescaled 


XTX; 


-I| < 6,CX) for all subsets S of size at most s. (7.14) 


2 


d 


columns of X are near-unit-norm—that is, we are guaranteed that ksi € [1 — 6;, 1 + 6] for 
all j = 1,2,...,d. For s = 2, the RIP constant 62 is very closely related to the pairwise 
incoherence parameter dpw(X). This connection is most apparent when the matrix X/yn has 
unit-norm columns, in which case, for any pair of columns {j,k}, we have 


IX; (Xj Xo) XX 
a J= aoe n © 9 n 
n OT | XXò Xa i (Xj, Xe) 
n n n 


where the final equality (i) uses the column normalization condition. Consequently, we find 
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(Xj, Xp 


62(X) = = max = dpw(X), 


jk 


| Xia Xun 
Ww = 


2 


where the final step again uses the column normalization condition. More generally, as we 
show in Exercise 7.4, for any matrix X and sparsity level s € {2,...,d}, we have the sand- 
wich relation 


Spw(X) $ 2 Ô, PoE ? sSpw(X), (7.15) 


and neither bound can be improved in general. (We also show that there exist matrices for 
which 6,(X) = Vs dpw(X).) Although RIP imposes constraints on much larger submatri- 
ces than pairwise incoherence, the magnitude of the constraints required to guarantee the 
uniform restricted nullspace property can be milder. 

The following result shows that suitable control on the RIP constants implies that the re- 
stricted nullspace property holds: 


Proposition 7.11 Jf the RIP constant of order 2s is bounded as 62,(X) < 1/3, then the 
uniform restricted nullspace property holds for any subset S of cardinality |S | < s. 


Proof Let @ € null(X) be an arbitrary non-zero member of the nullspace. For any subset 
A, we let 6, € R'4! denote the subvector of elements indexed by A, and we define the vector 
6, € R? with elements 


7 \0 otherwise. 


5 ( hry eA 
We frequently use the fact that Il@all = ||@4|| for any elementwise separable norm, such as the 
€,- or €2-norms. 

Let S be the subset of {1,2,...,d} corresponding to the s entries of 6 that are largest in 
absolute value. It suffices to show that ||@s<||; > ||@s||; for this subset. Let us write $° = 
Uj>1 Sj, where Sı is the subset of indices given by the s largest values of Osc; the subset S 5 
is the largest s in the subset S° \ S1, and the final subset may contain fewer than s entries. 
Using this notation, we have the decomposition a Os + Dine Os.. 

Xall „: Moreover, since 0 € null(X), 


The RIP property guarantees that IIs IÊ < 
we have X65 =- Dipl X6s,, and hence 


= a ls 


1 (Xðs,, Xðs,) 


DS 


jel 


© 


ðs l2 < 
ls. — -— 


2s 


2 a] <> = te ; 


where equality (1) uses the fact that (Os, 4s,) =0 
By the RIP property, for each j > 1, the €; — fz operator norm satisfies the bound 
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lla- XS us, Xsous; — Lisle < 62s, and hence we have 


ll@solle < 5 


 D öyle (7.16) 


2s jèl 


where we have canceled out a factor of ls,ll2 from each side. Finally, by construction ¢ of the 
sets S;, for each j > 1, we have |lôs;llo < 1 llðs _,ll, which implies that ||@5 lz < i llðs lh. 
Applying these upper bounds to the inequality (7.16), we obtain 


fh +) 16s, nh. 


jèl 


lðs,llı < < Vs llðs,ll2 < 


Ô2s 
1-625 


equality implies that [Asli < lOselli as long as 63, < 1/3. 


or equivalently [sll < {Isl + ll@selli}. Some simple algebra verifies that this in- 


Like the pairwise incoherence constant, control on the RIP constants is a sufficient con- 
dition for the basis pursuit LP to succeed. A major advantage of the RIP approach is that 
for various classes of random design matrices, of particular interest in compressed sensing 
(see Example 7.5), it can be used to guarantee exactness of basis pursuit using a sample 
size n that is much smaller than that guaranteed by pairwise incoherence. As we explore in 
Exercise 7.7, for sub-Gaussian random matrices with i.i.d. elements, the pairwise incoher- 
ence is bounded by + with high probability as long as n = s? log d. By contrast, this same 
exercise also shows that the RIP constants for certain classes of random design matrices X 
are well controlled as long as n = slog(ed/s). Consequently, the RIP approach overcomes 
the “quadratic barrier”—namely, the requirement that the sample size n scales quadratically 
in the sparsity s, as in the pairwise incoherence approach. 

It should be noted that, unlike the restricted nullspace property, neither the pairwise inco- 
herence condition nor the RIP condition are necessary conditions. Indeed, the basis pursuit 
LP succeeds for many classes of matrices for which both pairwise incoherence and RIP 
conditions are violated. For example, consider a random matrix X € R'™@ with i.i.d. rows 
Xi ~ N(O, x). Letting 1 € R? denote the all-ones vector, consider the family of covariance 
matrices 


Li=(1-wlg+y11', (7.17) 


for a parameter u € [0, 1). In Exercise 7.8, we show that, for any fixed u € (0, 1), the pairwise 
incoherence bound (7.13) is violated with high probability for large s, and moreover that the 
condition number of any 2s-sized subset grows at the rate u ys with high probability, so that 
the RIP constants will (with high probability) grow unboundedly as s — +00 for any fixed 
H € (0, 1). Nonetheless, for any u € [0, 1), the basis pursuit LP relaxation still succeeds with 
high probability with sample size n = slog(ed/s), as illustrated in Figure 7.4. Later in the 
chapter, we provide a result on random matrices that allows for direct verification of the re- 
stricted nullspace property for various families, including (among others) the family (7.17). 
See Theorem 7.16 and the associated discussion for further details. 
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Figure 7.3 (a) Probability of basis pursuit success versus the raw sample size n for 
random design matrices drawn with i.i.d. N(0, 1) entries. Each curve corresponds to 
a different problem size d € {128,256,512} with sparsity s = [0.1d]. (b) The same 
results replotted versus the rescaled sample size n/(s log(ed/s)). The curves exhibit 
a phase transition at the same value of this rescaled sample size. 


7.3 Estimation in noisy settings 


Let us now turn to the noisy setting, in which we observe the vector—matrix pair (y, X) € 
R” x R’™ linked by the observation model y = X6* +w. The new ingredient here is the noise 
vector w € R”. A natural extension of the basis pursuit program is based on minimizing a 
weighted combination of the data-fidelity term ||y — Xol with the ¢\-norm penalty, say of 


the form 
as 1 
f in 4 —||y — X6? + A, |All, $. 7.1 
€ are min {Ip Iz + Anll in} (7.18) 


Here 2, > 0 is a regularization parameter to be chosen by the user. Following Tibshi- 


rani (1996), we refer to it as the Lasso program. 
Alternatively, one can consider different constrained forms of the Lasso, that is either 


1 
min {srl — xag} such that ||6||; < R (7.19) 
n 


eR 
for some radius R > 0, or 


1 
min||6|; such that —|ly — X6||5 < b? (7.20) 
eA 2n 


for some noise tolerance b > 0. The constrained version (7.20) is referred to as relaxed basis 
pursuit by Chen et al. (1998). By Lagrangian duality theory, all three families of convex 
programs are equivalent. More precisely, for any choice of radius R > 0 in the constrained 
variant (7.19), there is a regularization parameter 2 > 0 such that solving the Lagrangian 
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Figure 7.4 (a) Probability of basis pursuit success versus the raw sample size n 
for random design matrices drawn with i.i.d. rows X; ~ N(0, £), where u = 0.5 
in the model (7.17). Each curve corresponds to a different problem size d € 
{128,256,512} with sparsity s = [0.1d]. (b) The same results replotted versus the 
rescaled sample size n/(slog(ed/s)). The curves exhibit a phase transition at the 
same value of this rescaled sample size. 


version (7.18) is equivalent to solving the constrained version (7.19). Similar statements 
apply to choices of b > 0 in the constrained variant (7.20). 


7.3.1 Restricted eigenvalue condition 


In the noisy setting, we can no longer expect to achieve perfect recovery. Instead, we focus 
on bounding the f2-error ie- 6||2 between a Lasso solution 8 and the unknown regression 
vector 6". In the presence of noise, we require a condition that is closely related to but slightly 
stronger than the restricted nullspace property—namely, that the restricted eigenvalues of the 
matrix xx are lower bounded over a cone. In particular, for a constant a > 1, let us define 
the set 


Co(S) = {A € R° | [Agel < alls li}. (7.21) 


This definition generalizes the set C(S) used in our definition of the restricted nullspace 
property, which corresponds to the special case œ = 1. 
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Definition 7.12 The matrix X satisfies the restricted eigenvalue (RE) condition over 
S with parameters (x, œ) if 


1 
-IXA > Kl|Al forall A € C,(S). (7.22) 


Note that the RE condition is a strengthening of the restricted nullspace property. In par- 
ticular, if the RE condition holds with parameters (x, 1) for any x > 0, then the restricted 
nullspace property holds. Moreover, we will prove that under the RE condition, the error 
ie- 6"||> in the Lasso solution is well controlled. 

From where does the need for the RE condition arise? To provide some intuition, let us 
consider the constrained version (7.19) of the Lasso, with radius R = ||6*||,. With this setting, 
the true parameter vector 6* is feasible for the problem. By definition, the Lasso estimate 0 
minimizes the quadratic cost function L£,,(0) = +lly — Xol over the /1-ball of radius R. As 
the amount of data increases, we expect that 6* should become a near-minimizer of the same 
cost function, so that L£,(0) x L,(6"). But when does closeness in cost imply that the error 
vector A := @— 6* is small? As illustrated in Figure 7.5, the link between the cost difference 
OL, = L£,(6") - L£,(0) and the error A = 6 — & is controlled by the curvature of the cost 
function. In the favorable setting of Figure 7.5(a), the cost has a high curvature around its 
optimum ©, so that a small excess loss 6L, implies that the error vector A is small. This 
curvature no longer holds for the cost function in Figure 7.5(b), for which it is possible that 
Ln could be small while the error A is relatively large. 


Figure 7.5 Illustration of the connection between curvature (strong convexity) of 
the cost function, and estimation error. (a) In a favorable setting, the cost func- 


tion is sharply curved around its minimizer 6, so that a small change ôL, := 


L£,(0*) - L£,(0) in the cost implies that the error vector A = 6 — @ is not too large. 
(b) In an unfavorable setting, the cost is very flat, so that a small cost difference 6L, 
need not imply small error. 


Figure 7.5 illustrates a one-dimensional function, in which case the curvature can be cap- 
tured by a scalar. For a function in d dimensions, the curvature of a cost function is captured 
by the structure of its Hessian matrix V? £,(0), which is a symmetric positive semidefinite 
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(A1, A2) 


(b) 


Figure 7.6 (a) A convex cost function in high-dimensional settings (with d > n) 
cannot be strongly convex; rather, it will be curved in some directions but flat in 


others. (b) The Lasso error A must lie in the restricted subset C,(S) of Rf. For this 
reason, it is only necessary that the cost function be curved in certain directions of 
space. 


matrix. In the special case of the quadratic cost function that underlies the Lasso, the Hessian 
is easily calculated as 


V? L0) = Po (1.23) 


If we could guarantee that the eigenvalues of this matrix were uniformly bounded away from 
zero, say that 


XAJ? 
[žak > KlAl>0 forall A € R? \ {0}, (7.24) 


then we would be assured of having curvature in all directions. 

In the high-dimensional setting with d > n, this Hessian is a dXd matrix with rank at most 
n, so that it is impossible to guarantee that it has a positive curvature in all directions. Rather, 
the quadratic cost function always has the form illustrated in Figure 7.6(a): although it may 
be curved in some directions, there is always a (d — n)-dimensional subspace of directions 
in which it is completely flat! Consequently, the uniform lower bound (7.24) is never satis- 
fied. For this reason, we need to relax the stringency of the uniform curvature condition, and 
require that it holds only for a subset C4(S ) of vectors, as illustrated in Figure 7.6(b). If we 
can be assured that the subset C,(S) is well aligned with the curved directions of the Hes- 
sian, then a small difference in the cost function will translate into bounds on the difference 
between 8 and 6”. 


7.3.2 Bounds on €)-error for hard sparse models 


With this intuition in place, we now state a result that provides a bound on the error \io— 6 il 
in the case of a “hard sparse” vector 6”. In particular, let us impose the following conditions: 
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(A1) The vector 6* is supported on a subset S$ C {1,2,...,d} with |S| = s. 
(A2) The design matrix satisfies the restricted eigenvalue condition (7.22) over S with pa- 
rameters (K, 3). 


The following result provides bounds on the ¢-error between any Lasso solution @ and the 
true vector 6*. 


Theorem 7.13 Under assumptions (A1) and (A2): 


(a) Any solution ue the EL Lasso (7.18) with regularization parameter lower 


bounded as A, > 2| “II satisfies the bound 
B- ol, < = VsA,. (7.25a) 
K 
(b) Any solution of the constrained Lasso (7.19) with R = ||6*||; satisfies the bound 
— 4 x? 
i- ele < = Vs |= (1.250) 
(c) Any solution of the relaxed basis pursuit program (7.20) with b? > as satisfies the 
bound 
a 4 -||XTw lwll 
0- Fih < - — — 1/b? - ——. VHD 
I Ib <= vs Enr F (7.25c) 


In addition, all three solutions satisfy the €,-bound ie- lli < 4V5 lio - 6 |l>. 


In order to develop intuition for these claims, we first discuss them at a high level, and 
then illustrate them with some concrete examples. First, it is important to note that these 
results are deterministic, and apply to any set of linear regression equations. As stated, how- 
ever, the results involve unknown quantities stated in terms of w and/or 6". Obtaining results 
for specific statistical models—as determined by assumptions on the noise vector w and/or 
the design matrix—involves bounding or approximating these quantities. Based on our ear- 
lier discussion of the role of strong convexity, it is natural that all three upper bounds are 
inversely proportional to the restricted eigenvalue constant x > 0. Their scaling with ys is 
also natural, since we are trying to estimate the unknown regression vector with s unknown 
entries. The remaining terms in the bound involve the unknown noise vector, either via the 
quantity =" + ]l in parts (a), (b) and (c), or additionally via u in part (c). 

Let us illustrate some concrete consequences of Theorem i 13 for some linear regression 
models that are commonly used and studied. 


Example 7.14 (Classical linear Gaussian model) We begin with the classical linear Gaus- 
sian model from statistics, for which the noise vector w € R” has i.i.d. N(0, 07) entries. 
Let us consider the case of deterministic design, meaning that the matrix X € R’“ is fixed. 
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Suppose that X satisfies the RE condition (7.22) and that it is C-column normalized, mean- 
ik < C, where X; € R” denotes the jth column of X. With this set-up, 


[Ex 
n 


EEN 


the random variable | I corresponds to the absolute maximum of d zero-mean Gaus- 


sian variables, each with variance at most Co Consequently, from standard Gaussian tail 
bounds (Exercise 2.12), we have 


{PELE 


onsequently, if we set 1, = o\.,/—= + 0}, then Theorem 7.13(a) implies that an 
Consequently, if An = 2Co(./ 74 + 6), then Th 7.13(a) implies that any 


optimal solution of the Lagrangian Lasso (7.18) satisfies the bound 


es 6C [2 log d 
lle - ll < — vf _ + a} (7.26) 


with probability at least 1 — 26, Similarly, Theorem 7.13(b) implies that any optimal 
solution of the constrained Lasso (7.19) satisfies the bound 


Í@- elk < r gf EEs +o} (7.27) 
K n 


with the same probability. Apart from constant factors, these two bounds are equivalent. 
Perhaps the most significant difference is that the constrained Lasso (7.19) assumes exact 
knowledge of the ¢,-norm ||6*||;, whereas the Lagrangian Lasso only requires knowledge 
of the noise variance o. In practice, it is relatively straightforward to estimate the noise 
variance, whereas the £|-norm is a more delicate object. 


XTw 
n 


< Ie for all 6 > 0. 


o0 


2 
Turning to Theorem 7.13(c), given the Gaussian noise vector w, the rescaled variable lwi 


| 
is y? with n degrees of freedom. From Example 2.11, we have 


i 


Consequently, Theorem 7.13(c) implies that any optimal solution of the relaxed basis pursuit 
program (7.20) with b? = a + ô) satisfies the bound 


— 8C 2logd 2 
-er < 2 vst 7 vob 5 forall e€ (0,1), 
K n Vk 


with probability at least 1 — 4e, 4 


Ill; 
2 _ 


> o°6| <2e""8 forall ô € (0, 1). 


Example 7.15 (Compressed sensing) In the domain of compressed sensing, the design 
matrix X can be chosen by the user, and one standard choice is the standard Gaussian matrix 
with iid. N(O, 1) entries. Suppose that the noise vector w € R” is deterministic, say with 
bounded entries (||W||. < o). Under these assumptions, each variable Xjw/ vn is a zero- 


mean Gaussian with variance at most o°. Thus, by following the same argument as in the 
preceding example, we conclude that the Lasso estimates will again satisfy the bounds (7.26) 
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and (7.27), this time with C = 1. Similarly, if we set b? = a then the relaxed basis pursuit 
program (7.19) will satisfy the bound 


21 2 
i — ell < Zvi [ee 5 oh 4 
n vk 


with probability at least 1 — 2e7 = & 
With these examples in hand, we now turn to the proof of Theorem 7.13. 


Proof (b) We begin by proving the error bound (7.25b) for the constrained Lasso (7.19). 
Given the choice R = ||6"||,, the target vector 6” is feasible. Since Ois ; optimal, we have the 
inequality x|b- Xa < < +lly- X6 |È. Defining the error vector A := @— 6* and performing 
some algebra yields the basic inequality 


IXA _ 2w™XA 


< (7.28) 

n n 
Applying Hölder’s inequality to the right-hand side yields Lal < 2||<¥ xa E IAllı. As shown 
in the proof of Theorem 7.8, whenever lah < < |l@||, for an S-sparse vector, the error A 


belongs to the cone C,(S), whence 


All, = Asli + Asedh < 2IAsllh < 2 vs Ah. 


Since Cı (S) is a subset of C3(S), we may apply the restricted pas condition (7.22) to 


the left-hand side of the inequality (7.28), thereby obtaining ŽE zal 
the pieces yields the claimed bound. 


> KAI. Putting together 


(c) Next we prove the error bound (7.25c) for the relaxed basis pursuit (RBP) program. 
Note that +l - X6 = ie < b’, where the inequality follows by our assumed choice 


of b. Thus, gn target vector Bi is feasible, and since 9 is optimal, we have iial < |l@ll. As 
previously reasoned, the error vector A =0- 0 must then belong to the cone Cı (S). Now 
by the feasibility of 0, we have 


1 ee 1 IlwIl5 
—|ly - Xz < b? = —Ily - Xe +10? - — |. 
z lp < L Ils + on 


Rearranging yields the modified basic inequality 


IXA _ wIXA TPA lwli 
= n 2n | 


n 


Applying the same argument as in part (b)—namely, the RE condition to the left-hand side 
and the cone inequality to the right-hand side—we obtain 


= Iwl 
p 2n F 


which implies that [Ally < ê y5 || ££] + 2 fb? - H, as claimed. 


X'w 


KIIAIG < 4 Vs IAk 


2n 
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(a) Finally, we prove the bound (7.25a) for the Lagrangian Lasso (7.18). Our first step is 
to show that, under the condition 4, > 2|| Lvl, the error vector A belongs to C3(S). To es- 
tablish this intermediate claim, let us define the Lagrangian L(0; A,,) = +lly — Xall + A,,|lOll1. 


Since 0 is optimal, we have 
in. ‘ol 1 X 
L(@; An) < L(@"; An) = 5, wli + Anll@'ll. 
Rearranging yields the Lagrangian basic inequality 


wIXA 


Drix = 
0 < z; XAI = + An lh = llall}. (7.29) 


Now since @* is S-sparse, we can write 
[6th = IE = 1651h = 185 + Asth ~ [Asch 
Substituting into the basic inequality (7.29) yields 


wIXA 


di oe : Sette, 2a 
0< —|XAll3 <2 + 2Anth@Slh = 5 + Aslli = WAselli 


n 


(i ee ae es 
< 2||X*w/nlloo [Alli + 2AnllAs ll — lAsellid 


(ii) ~ 22 
< An{3 |lAslh = llAselli}, (7.30) 


where step (i) follows from a combination of Hölder’s inequality and the triangle inequality, 
whereas step (ii) follows from the choice of A,. Inequality (7.30) shows that Ae C3(S), 
so that the RE condition may be applied. Doing so, we obtain KA < 3A, VS IAll2, which 
implies the claim (7.25a). 


7.3.3 Restricted nullspace and eigenvalues for random designs 


Theorem 7.13 is based on assuming that the design matrix X satisfies the restricted eigen- 
value (RE) condition (7.22). In practice, it is difficult to verify that a given design matrix X 
satisfies this condition. Indeed, developing methods to “certify” design matrices in this way 
is one line of on-going research. However, it is possible to give high-probability results in 
the case of random design matrices. As discussed previously, pairwise incoherence and RIP 
conditions are one way in which to certify the restricted nullspace and eigenvalue properties, 
and are well suited to isotropic designs (in which the population covariance matrix of the 
rows X; is the identity). Many other random design matrices encountered in practice do not 
have such an isotropic structure, so that it is desirable to have alternative direct verifications 
of the restricted nullspace property. 


The following theorem provides a result along these lines. It involves the maximum diagonal 
entry p° (£) of a covariance matrix X. 
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Theorem 7.16 Consider a random matrix X € R'4, in which each row x; € R’ is 
drawn i.i.d. from a N(O,) distribution. Then there are universal positive constants 
cı < 1 < c such that 


al 


> all VEAR- op) Ei forall oe R! (7.31) 


e7n/32 


with probability at least 1 — 7a. 


d 


Remark: The proof of this result is provided in the Appendix (Section 7.6). It makes use 
of techniques discussed in other chapters, including the Gordon—Slepian inequalities (Chap- 
ters 5 and 6) and concentration of measure for Gaussian functions (Chapter 2). Concretely, 
we show that the bound (7.31) holds with cı = l and c2 = 50, but sharper constants can be 
obtained with a more careful argument. It can be shown (Exercise 7.11) that a lower bound 
of the form (7.31) implies that an RE condition (and hence a restricted nullspace condition) 
holds over C3(S), uniformly over all subsets of cardinality |S | < a Z ib z 

Theorem 7.16 can be used to establish restricted nullspace and eigenvalue conditions for 
various matrix ensembles that do not satisfy incoherence or RIP conditions. Let us consider 


a few examples to illustrate. 


Example 7.17 (Geometric decay) Consider a covariance matrix with the Toeplitz structure 
X;; = v" for some parameter v € [0, 1). This type of geometrically decaying covariance 
structure arises naturally from autoregressive processes, where the parameter v allows for 
tuning of the memory in the process. By classical results on eigenvalues of Toeplitz ma- 
trices, we have Ymin(L) > (1 - v)* > 0 and p(X) = 1, independently of the dimension 
d. Copei, Theorem 7.16 implies that, with high probability, the sample covariance 


matrix È = T X obtained by sampling from this distribution will satisfy the RE condition 
for all subsets S of cardinality at most |S | < ae (1 - ise z- This provides an example of 


a matrix family with substantial correlation between covariates for which the RE property 
still holds. & 


We now consider a matrix family with an even higher amount of dependence among the 
covariates. 


Example 7.18 (Spiked identity model) Recall from our earlier discussion the spiked iden- 
tity family (7.17) of covariance matrices. This family of covariance matrices is parame- 
terized by a scalar u € [0,1), and we have Ymin(Z) = 1 — u and p*(X) = 1, again indepen- 
dent of the dimension. Consequently, Theorem 7.16 implies that, with high probability, the 
sample covariance based on i.i.d. draws from this ensemble satisfies the restricted eigen- 
value and restricted nullspace conditions uniformly over all subsets of cardinality at most 
IS] < oy (l - Wiss bed: 

However, for any u + 0, the spiked identity matrix is very poorly conditioned, and also 
has poorly conditioned submatrices. This fact implies that both the pairwise incoherence 
and restricted isometry property will be violated with high probability, regardless of how 
large the sample size is taken. To see this, for an arbitrary subset S of size s, consider the 
associated s X s submatrix of X, which we denote by Xss. The maximal eigenvalue of Ls 5 
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scales as 1 + u(s — 1), which diverges as s increases for any fixed u > 0. As we explore in 
Exercise 7.8, this fact implies that both pairwise incoherence and RIP will be violated with 
high probability. & 


When a bound of the form (7.31) holds, it is also possible to prove a more general result 
on the Lasso error, known as an oracle inequality. This result holds without any assumptions 
whatsoever on the underlying regression vector 6* € Rf, and it actually yields a family of 
upper bounds with a tunable parameter to be optimized. The flexibility in tuning this pa- 
rameter is akin to that of an oracle, which would have access to the ordered coefficients of 
6°. In order to minimize notational clutter, we introduce the convenient shorthand notation 


Ki= Ymin(X). 


Theorem 7.19 (Lasso oracle inequality) Under the condition (7.31), consider the 
Lagrangian Lasso (7.18) with regularization parameter A, > 2\|X'w/nlloo. For any 
6* € R4, any optimal solution 0 satisfies the bound 


ue 2? 16 A, 32c2 p Œ) a 


2 
@- Os < > SIS + — S 6lh + [105-12 (7.32) 
Ci 
eS eee 
estimation error approximation error 
K n 

valid for any subset S with cardinality |S| < ae 25) ae | 

b 


Note that inequality (7.32) actually provides a family of upper bounds, one for each valid 
choice of the subset S. The optimal choice of S is based on trading off the two sources of 
error. The first term grows linearly with the cardinality |S|, and corresponds to the error as- 
sociated with estimating a total of |S| unknown coefficients. The second term corresponds 
to approximation error, and depends on the unknown regression vector via the tail sum 
sell = Dies Oil. An optimal bound is obtained by choosing S$ to balance these two terms. 
We illustrate an application of this type of trade-off in Exercise 7.12. 


Proof Throughout the proof, we use p° as a shorthand for p?(Z). Recall the argument 
leading to the bound (7.30). For a general vector @* € Rf, the same argument applies with 
any subset S except that additional terms involving ||9¢.||; must be tracked. Doing so yields 
that 


E We z ‘ 
0 < = IXAI5 < ZHBIÂs Ih — lAselh + 211651h)- (7.33) 


This inequality implies that the error vector A satisfies the constraint 


IÂIR < (41As lh + 2165 ells)? < 32 1S AIS + 8164. (7.34) 
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Combined with the bound (7.31), we find that 


IX AJ logd 
— > feik -320p is ee yap - 80 165. 


lo ved s 
2 == Bsc (7.35) 


K A 
> C15 IÂIŻ — 8c207 


where the final inequality uses the condition 32c2p7|S #84 


the analysis into two cases. 


< c,§. We split the remainder of 


Case I: First suppose that cı SIAN > 8c? wee igs |. Combining the bounds (7.35) 
and (7.33) yields 


K ~ Àn ~ a 
cız llAlk < 5 3 VISIA + 2[15elli}- (7.36) 
This bound involves a quadratic form in \|Allo; computing the zeros of this quadratic form, 
we find that 
14422 16AAIIGS ll 


IAI < < ~IS] + 
Ge 


CiK 


Case 2: Otherwise, we must have c;§||A| < 8c2p° tilo; IÈ. 


Taking into account both cases, we combine this bound with the earlier inequality (7.36), 
thereby obtaining the claim (7.32). 


7.4 Bounds on prediction error 


In the previous analysis, we have focused exclusively on the problem of parameter recovery, 
either in noiseless or noisy settings. In other applications, the actual value of the regression 
vector 6* may not be of primary interest; rather, we might be interested in finding a good 
predictor, meaning a vector 0 € R? such that the mean-squared prediction error 


Xo — EU a fay (7.37) 


is small. To understand why the quantity (7.37) is a measure of prediction error, suppose 
that we estimate 6 on the basis of the response vector y = X6* + w. Suppose that we then 
receive a “fresh” vector of responses, say y = X6* + w, where w € R” is a noise vector, with 
i.i.d. zero-mean entries with variance a. We can then measure the quality of our vector 8 by 
how well it predicts the vector y in terms of squared error, taking averages over instantiations 
of the noise vector w. Following some algebra, we find that 


1 es 1 aS 
~ Elly - Xoji] = -IX - P + o’, 
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so that apart from the constant additive factor of 07, the quantity (7.37) measures how well 
we can predict a new vector of responses, with the design matrix held fixed. 


It is important to note that, at least in general, the problem of finding a good predictor 
should be easier than estimating 6* well in €:-norm. Indeed, the prediction problem does not 
require that 6* even be identifiable: unlike in parameter recovery, the problem can still be 
solved if two columns of the design matrix X are identical. 


x 
Theorem 7.20 (Prediction error bounds) Consider the Lagrangian Lasso (7.18) with 
a strictly positive regularization parameter A, > 2|| Xv Io: 
(a) Any optimal solution 6 satisfies the bound 
XE- È 
EOE < 1216 lAn. (7.38) 
(b) If & is supported on a subset S of cardinality s, and the design matrix satisfies the 
(K; 3)-RE condition over S, then any optimal solution satisfies the bound 
XO- PÈ 9 
BEA L sA}. (7.39) 
n K J 


Remarks: As previously discussed in Example 7.14, when the noise vector w has i.i.d. zero- 
mean o-sub-Gaussian entries and the design matrix is C-column normalized, the choice 


= 2Co(4 = osin, 6) is valid with probability at least 1 — 2, In this case, Theorem 
: 20(a) implies the upper bound 


XE- o 21 
REEL 24 Whol f ES +a) (7.40) 
n n 


with the same high probability. For this bound, the requirements on the eae matrix are 
a Xie < C, Thus, the 
matrix X could have many identical columns, and this would have no effect on the prediction 
error. In fact, when the only constraint on 6” is the ¢,-norm bound ||6*||ı < R, then the 
bound (7.40) is unimprovable—see the bibliographic section for further discussion. 
On the other hand, when 6° is s-sparse and in addition, the design matrix satisfies an RE 
condition, then Theorem 7.20(b) guarantees the bound 


X(@- 6113 
I[X( Mls D 2 cra eee $ s] 
n 


(7.41) 
n K 


with the same high probability. This error bound can be significantly smaller than the ed 
error bound (7.40) guaranteed under weaker assumptions. For this reason, the bounds (7.38) 
and (7.39) are often referred to as the slow rates and fast rates, respectively, for prediction 
error. It is natural to question whether or not the RE condition is needed for achieving the 
fast rate (7.39); see the bibliography section for discussion of some subtleties surrounding 
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this issue. 


Proof Throughout the proof, we adopt the usual notation A = 6- 6 for the error vector. 


(a) We first show that All, < 4||6"||, under the stated conditions. From the Lagrangian 
basic inequality (7.29), we have 


wIXA 2 ~ 
+ AMO lh — lOl}. (7.42) 


fee 
0 < —||XAI/? < 
< zl ll < 7 


By Hölder’s inequality and our choice of 2,,, we have 
wIXA XTw 
n 


< 


2 e n = 
IA < 5 Meh + 18l), 
where the final step also uses the triangle inequality. Putting together the pieces yields 
Àn * I * Di 
Os FAME + Ne + Afe = Iel), 


which (for 2, > 0) implies that ilh < 3e". Consequently, a final application of the 
triangle inequality yields ||Al|; < ll6llı + lll < 4ll0"llı, as claimed. 
We can now complete the proof. Returning to our earlier inequality (7.42), we have 
IXA An 


~ ~ 3A 
< |All], + 44l — I6 + Alh} < — 
3z, $ z ll li + (lell — IO" + All} 5 


All, 


where step (i) is based on the triangle inequality bound ||6* + Alli > |||, — Alh. Combined 
with the upper bound ||Al|; < 4||6*||;, the proof is complete. 


(b) In this case, the same argument as in the proof of Theorem 7.13(a) leads to the basic 
inequality 


IXA ~ N 
—> < 3AnllAsll < 34, VsIAll2. 


Similarly, the proof of Theorem 7.13(a) shows that the error vector A belongs to C3(S), 
whence the (x; 3)-RE condition can be applied, this time to the right-hand side of the basic 


inequality. Doing so yields All < DBAL, and hence that KAk < 4 VsA,, as claimed. 


7.5 Variable or subset selection 


Thus far, we have focused on results that guarantee that either the £2-error or the prediction 
error of the Lasso is small. In other settings, we are interested in a somewhat more refined 
question, namely whether or not a Lasso estimate 8 has non-zero entries in the same positions 
as the true regression vector 6*. More precisely, suppose that the true regression vector 6” is 
s-sparse, meaning that it is supported on a subset S (6*) of cardinality s = |S(6*)|. In such a 
setting, a natural goal is to correctly identify the subset S (6") of relevant variables. In terms 
of the Lasso, we ask the following question: given an optimal Lasso solution @, when is 
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its support set—denoted by S (0)—exactly equal to the true support S (6*)? We refer to this 
property as variable selection consistency. 

Note that it is possible for the f2-error lio - 6" ||2 to be quite small even if @ and 6" have 
different supports, as long as @ is non-zero for all “suitably large” entries of 6*, and not too 
large in positions where 6* is zero. On the other hand, as we discuss in the sequel, given an 
estimate 6 that correctly recovers the support of 6*, we can estimate 9* very well (in €,-norm, 
or other metrics) simply by performing an ordinary least-squares regression restricted to this 
subset. 


7.5.1 Variable selection consistency for the Lasso 


We begin by addressing the issue of variable selection in the context of deterministic design 
matrices X. (Such a result can be extended to random design matrices, albeit with additional 
effort.) It turns out that variable selection requires some assumptions that are related to but 
distinct from the restricted eigenvalue condition (7.22). In particular, consider the following 
conditions: 


(A3) Lower eigenvalue: The smallest eigenvalue of the sample covariance submatrix in- 
dexed by S is bounded below: 


XTX; 
Ymin 2 Cmin > 0. (7.43a) 
n 


(A4) Mutual incoherence: There exists some a € [0, 1) such that 


max IX EXO XE Xl <a. (7.43b) 
JES: 


To provide some intuition, the first condition (A3) is very mild: in fact, it would be re- 
quired in order to ensure that the model is identifiable, even if the support set S were known 
a priori. In particular, the submatrix Xs € R”*' corresponds to the subset of covariates that 
are in the support set, so that if assumption (A3) were violated, then the submatrix Xs would 
have a non-trivial nullspace, leading to a non-identifiable model. Assumption (A4) is a more 
subtle condition. In order to gain intuition, suppose that we tried to predict the column vector 
X; using a linear combination of the columns of Xs. The best weight vector © € R'! is given 
by 


or arg min |X; — Xs olé = (X$Xs XIX, 


and the mutual incoherence condition is a bound on ||@|,. In the ideal case, if the column 
space of Xs were orthogonal to X;, then the optimal weight vector @ would be identically 
zero. In general, we cannot expect this orthogonality to hold, but the mutual incoherence 
condition (A4) imposes a type of approximate orthogonality. 


With this set-up, the following result applies to the Lagrangian Lasso (7.18) when applied 
to an instance of the linear observation model such that the true parameter 6* is supported 
on a subset S$ with cardinality s. In order to state the result, we introduce the convenient 
shorthand Is.(X) = I, — Xs(X§Xs)"'X$, a type of orthogonal projection matrix. 
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Theorem 7.21 Consider an S-sparse linear regression model for which the design 
matrix satisfies conditions (A3) and (A4). Then for any choice of regularization param- 
eter such that 


2 
oo = 


> z (7.44) 
l-a 


co 


w 
Xo Is- (X) z 


the Lagrangian Lasso (7.18) has the following properties: 


(a) Uniqueness: There is a unique optimal solution 0. 
(b) No false inclusion: This solution has its support set S contained within the true 


support set S. TA 
(c) €..-bounds: The error 0 — & satisfies 
XXH 
E 
z n 
B(Ay;X) 


l@s — Olleo < ae (7.45) 


o0 


(== if = 
S KA 
n n 


(d) No false exclusion: The Lasso includes all indices i € S such that |0}| > B(A,;X), 
and hence is variable selection consistent if minjcs |9;| > B(An; X). 


d 


Before proving this result, let us try to interpret its main claims. First, the uniqueness claim 
in part (a) is not trivial in the high-dimensional setting, because, as discussed previously, 
although the Lasso objective is convex, it can never be strictly convex when d > n. Based on 
the uniqueness claim, we can talk unambiguously about the support of the Lasso estimate 
©. Part (b) guarantees that the Lasso does not falsely include variables that are not in the 
support of 6*, or equivalently that Ose = 0, whereas part (d) is a consequence of the sup- 
norm bound from part (c): as long as the minimum value of |07| over indices i € S is not too 
small, then the Lasso is variable selection consistent in the full sense. 

As with our earlier result (Theorem 7.13) on ¢2-error bounds, Theorem 7.21 is a deter- 
ministic result that applies to any set of linear regression equations. It implies more concrete 
results when we make specific assumptions about the noise vector w, as we show here. 


Corollary 7.22 Consider the S-sparse linear model based on a noise vector w with 
zero-mean i.i.d. o-sub-Gaussian entries, and a deterministic design matrix X that sat- 
isfies assumptions (A3) and (A4), as well as the C-column normalization condition 


ae | ASSOC S +o] (1.46) 
l-a n 


agen 


regularization parameter 
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for some 6 > 0. Then the optimal solution 8 is unique with its support contained within 
S, and satisfies the €..-error bound 


= 21 E 
ðs = Olle < — e e 
VCmin n n 
all with probability at least 1 — eS, 
L J 
Proof We first verify that the given choice (7.46) of regularization parameter satisfies the 
bound (7.44) with high probability. It suffices to bound the maximum absolute value of the 


random variables 


Aa (7.47) 


o0 


Zj := XT -XIX X (2) for jese. 
EA 
IHs (X) 


Since ITs. (X) is an orthogonal projection matrix, we have 


(i) 
IITs. (X)X illo < Xl < C Vn, 


where inequality (i) follows from the column normalization assumption. Therefore, each 
variable Z; is sub-Gaussian with parameter at most C*a?/n. From standard sub-Gaussian 
tail bounds (Chapter 2), we have 


nt2 
P| max IZ;l > | < 2(d = S)e 2020 i 
jese 


from which we see that our choice (7.46) of A, ensures that the bound (7.44) holds with the 


claimed probability. 
The only remaining step is to simplify the £~% -bound (7.45). The second term in this bound 
is a deterministic quantity, so we focus on bounding the first term. For each i = 1,...,5, 


consider the random variable Z; := eT (XTX) 'XTw/n. Since the elements of the vector w 
are i.i.d. o-sub-Gaussian, the variable Z; is zero-mean and sub-Gaussian with parameter at 


most 
2 1 =] 
[ms 


Oo 


o? 


< 


SF J 
n Cmin”l 


2 
where we have used the eigenvalue condition (7.43a). Consequently, for any ô > 0, we have 


P| maX;=1,...s Zil > =l fe + 6} < 2t, from which the claim follows. 


Corollary 7.22 applies to linear models with a fixed matrix X of covariates. An analogous 
result—albeit with a more involved proof—can be proved for Gaussian random covariate 
matrices. Doing so involves showing that a random matrix X drawn from the Ł-Gaussian 
ensemble, with rows sampled 1.i.d. from a N(0, X) distribution, satisfies the œ-incoherence 
condition with high probability (whenever the population matrix & satisfies this condition, 
and the sample size n is sufficiently large). We work through a version of this result in 
Exercise 7.19, showing that the incoherence condition holds with high probability with n = 
slog(d — s) samples. Figure 7.7 shows that this theoretical prediction is actually sharp, in 
that the Lasso undergoes a phase transition as a function of the control parameter 
See the bibliographic section for further discussion of this phenomenon. 


an 
slog(d—s)° 
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Figure 7.7 Thresholds for correct variable selection using the Lasso. (a) Probability 
of correct variable selection PIS = S ] versus the raw sample size n for three different 
problem sizes d € {128, 256,512} and square-root sparsity s = [Vd]. Each point cor- 
responds to the average of 20 random trials, using a random covariate matrix drawn 
from the Toeplitz ensemble of Example 7.17 with v = 0.1. Note that larger problems 
require more samples before the Lasso is able to recover the correct support. (b) The 
same simulation results replotted versus the rescaled sample size FEC =O Notice 


how all three curves are now well aligned, and show a threshold behavior, consistent 
with theoretical predictions. 


7.5.2 Proof of Theorem 7.21 


We begin by developing the necessary and sufficient conditions for optimality in the Lasso. 
A minor complication arises because the ¢,-norm is not differentiable, due to its sharp point 
at the origin. Instead, we need to work in terms of the subdifferential of the ,-norm. Given 
a convex function f: R? — R, we say that z € R is a subgradient of f at 6, denoted by 
z € Of (8), if we have 


f(O+A)> f(@)+(z, A) forall A eR’. 


When f(0) = ||êll, it can be seen that z € ôllêllı if and only if z; = sign(@;) for all j = 
1,2,...,d. Here we allow sign(0) to be any number in the interval [-1, 1]. In application 
to the Lagrangian Lasso program (7.18), we say that a pair (0,2) € R? x R? is primal-dual 
optimal if @ is a minimizer and Z € ôl||ð]l. Any such pair must satisfy the zero-subgradient 
condition 


1 = 
~X'(X6-y)+a,7=0, (7.48) 
n 


which is the analog of a zero-gradient condition in the non-differentiable setting. 

Our proof of Theorem 7.21 is based on a constructive procedure, known as a primal- 
dual witness method, which constructs a pair (6,2) satisfying the zero-subgradient condi- 
tion (7.48), and such that @ has the correct signed support. When this procedure succeeds, 
the constructed pair is primal—dual optimal, and acts as a witness for the fact that the Lasso 
has a unique optimal solution with the correct signed support. In more detail, the procedure 
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consists of the following steps: 


Primal—dual witness (PDW) construction: 


1 Set Os. = 0. 
2 Determine (@s,zs) € R° x R° by solving the oracle subproblem 


Os € arg min { Fl - Xs 4s\|5 +All}. (7.49) 
EE 
=: f(@s) 


and then choosing Zs € AlAs lh, such that VFO, i A = 0. 
3 Solve forZs- € R% via the zero-subgradient equation (7.48), and check whether or 


not the strict dual feasibility condition |[Zs5c||.. < 1 holds. 


Note that the vector şs- € R% is determined in step 1, whereas the remaining three 
subvectors are determined in steps 2 and 3. By construction, the subvectors Os, Zs and Zs: 
satisfy the zero-subgradient condition (7.48). By using the fact that 8s- = 6. = 0 and writing 
out this condition in block matrix form, we obtain 


1 X? Xs X! Xs. Os = O5 1 Xjw Zs 0 

n =e dt |= lol: i 

n x, XlXs || 0 n|Xt.w| fz] lo (7.50) 
We say that the PDW construction succeeds if the vector Zs: constructed in step 3 satisfies 


the strict dual feasibility condition. The following result shows that this success acts as a 
witness for the Lasso: 


Lemma 7.23 Ifthe lower eigenvalue condition (A3) holds, then success of the PDW con- 
struction implies that the vector (05,0) € R? is the unique optimal solution of the Lasso. 


Proof When the PDW construction succeeds, then = @s ,0) is an optimal solution with 
associated subgradient vector Z € R° satisfying |Zsello < 1, and Z, 6) = lfl. Now let 6 be 
any other optimal solution. If we introduce the shorthand notation F(6) = +lly — Xoll, then 
we are guaranteed that F (0) + AZ, 6) =F (0) + Alloh, and hence 


F(6) - An, 0 - 0) = F@) + Allil - & 0). 
But by the zero-subgradient conditions (7.48), we have 2,z = -VF (0), which implies that 
F(6) + VFO), 8- 0 — F@) = A, (|All: — & 4). 


By convexity of F, the left-hand side is negative, which implies that illl < Z, 6). But since 
we also have E © < |[Z]lollOll1, we must have ||6l|, = E @). Since |[Zscllo < 1, this equality 
can only occur if 6; =Oforall jess. 

Thus, all optimal solutions are supported only on S, and hence can be obtained by solving 
the oracle subproblem (7.49). Given the lower eigenvalue condition (A3), this subproblem 
is strictly convex, and so has a unique minimizer. 
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Thus, in order to prove Theorem 7.21(a) and (b), it suffices to show that the vector 
Zse € R% constructed in step 3 satisfies the strict dual feasibility condition. Using the 
zero-subgradient conditions (7.50), we can solve for the vector Zs- € R**, thereby finding 
that 


4 ae w 
Ze = - XIX -6)+ xt} (7.51) 


Similarly, using the assumed invertibility of X{Xzs in order to solve for the difference 65 — Os 
yields 


Os — 0% = (XTXs) XT w - Ayn(KTXs)~'Z5. (7.52) 


Substituting this expression back into equation (7.51) and simplifying yields 


aes = z w 
Zse = X$eX5 (Xf Xs) Zs + XSL — Xs (Xf Xs) pale (7.53) 
-—— SO n 
H 
Vse 


By the triangle inequality, we have |[Zsclloo < llull + ||Vscllo. By the mutual incoherence 
condition (7.43b), we have |lullo < œ. By our choice (7.44) of regularization parameter, we 
have ||Vscllo < (1 —q). Putting together the pieces, we conclude that [[Zsclloo. < 1a +a) <1, 
which establishes the strict dual feasibility condition. 

It remains to establish a bound on the -norm of the error Os — 0%. From equation (7.52) 


and the triangle inequality, we have 
le) 
i n 


eal 
X;— 
n n 


l@s - sll» < Àn (7.54) 


o0 


which completes the proof. 


7.6 Appendix: Proof of Theorem 7.16 
By a rescaling argument, it suffices to restrict attention to vectors belonging to the ellipse 


Stl(y) = {6 € R? | || VZOll, = 1}. Define the function g(t) := 2p(Z) «| “24 t, and the 
associated “bad” event 


&:= {x eRe (7.55) 


„e Xb 
int, AE < 5- zeda}: 


oS) yn 

We first claim that on the complementary set &°, the lower bound (7.31) holds. Let 6 € 
S*|(X) be arbitrary. Defining a = +, b = 2g(llêllı) and c = oe we have c > max{a — b, 0} 
on the event &°. We claim that this lower bound implies that c? > (1 — 6)’a* — 4b? for any 
ô € (0,1). Indeed, if 2 > a, then the claimed lower bound is trivial. Otherwise, we may 
assume that b < ôa, in which case the bound c > a — b implies that c > (1 — ô)a, and hence 
that c? > (1 — 6)’a’. Setting (1 — 8} = } then yields the claim. Thus, the remainder of our 
proof is devoted to upper bounding P[&]. 
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For a pair (re, r„) of radii such that 0 < rẹ < r,, define the sets 


K(re, ru) = {0 € S*"(Z) | g (Wl) € (re, ral} (7.56a) 
along with the events 
oo |X@llo _ 1 
Alre, Tu) := { aint, N <5- any) (7.56b) 


Given these objects, the following lemma is the central technical result in the proof: 


Lemma 7.24 For any pair of radii 0 < rẹ < ry, we have 


PACs, ru)] < Be? (7.57a) 
Moreover, for u = 1/4, we have 
EC AQ, u) U ( ian zw) (7.57b) 
1 


Based on this lemma, the remainder of the proof is straightforward. From the inclu- 
sion (7.57b) and the union bound, we have 


PIE] < PLACO, W] + Y PLAQ' u, 2'9] < et 5 eel 


t=1 c=0 


Since u = 1/4 and 2” > 2£, we have 


(ove) o0 ARES 
j ‘i 7 ex 
PIE) se Ver <eh Vem) < - 
t0 t0 ee 


It remains to prove the lemma. 


Proof of Lemma 7.24: We begin with the inclusion (7.57b). Let 0 € S“'(X) be a vector 
that certifies the event &; then it must belong either to the set K(0, u) or to a set K(2™!y, 24) 
for some £ = 1,2,.... 


Case l: First suppose that 6 € K(O, u), so that g(||6l1) < u = 1/4. Since 8 certifies the event 
&, we have 


|X|, — 1 1 1 
< — —22(||4 <-=+=-4By, 
Era GDES 5 TH 
showing that event A(0, u) must happen. 
Case 2: Otherwise, we must have 6 € K(2^!u, 2°w) for some £ = 1,2,..., and moreover 


Xall 
yn 


1 1 1 
< = Deli) < =- 20) < = - 2! 
< 3728s 5-2 s3 ; 
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which shows that the event A(2^! u, 2) must happen. 
We now establish the tail bound (7.57a). It is equivalent to upper bound the random vari- 


able T (re, ru) = — infgeKcryr,) ae . By the variational representation of the £,-norm, we have 
, X0 : , X0 
T(re, ru) =— inf sup WA sup inf u ) 


OEK(re,ru) yeg! yn OEK(re,ru) “ES”! yn 


Consequently, if we write X = WV©, where W € R”™@ is a standard Gaussian matrix and 
define the transformed vector v = V£ 0, then 
B IXa _ gun: aa (u, Wv) 
OEK(re,ru) yn veker 5 ucSr-! vn > 
2 —— 


Luy 


(7.58) 


where K(re, ru) = {v € R4 | [Iv = 1, 8(£72v) € [re ral}. 

Since (u, v) range over a subset of S™! x S“!, each variable Z,,y is zero-mean Gaussian 
with variance n7'. Furthermore, the Gaussian comparison principle due to Gordon, previ- 
ously used in the proof of Theorem 6.1, may be applied. More precisely, we may compare 
the Gaussian process {Z,,,,} to the zero-mean Gaussian process with elements 


_ {g,u) _ <h, v) 
hes tae + a ; 


Applying Gordon’s inequality (6.65), we find that 


where g € R”, h € R? have i.i.d. N(0, 1) entries. 


ELT (re, ru)] = | sup inf A < | sup inf a 


pies nee n—-1 
ve (rr) ve RK (rr) "SS 


= I (h, 2] | a (g, | 

Ea vn 5 eer vn 

: | s 2a gut 
yn 


OEK(re,ru) yn 


On one hand, we have E[||g|l2] > yn Y . On the other hand, applying Hölder’s inequality 
yields 


© 
<r 


= Uu’ 


sup [Øll 


OEK(re,ru) 


cnead 


. | || VE All 
oeKrer) Vn yn 


where step (i) follows since E [L2] < 2p(Z) aj 24 and supgex(r,,r,) Ali < TENTAT 
by the definition (7.56a) of K. Putting together the pieces, we have shown that 


EIT (re, ry] 4 A (7.59) 


From the representation (7.58), we see that the random variable yn T(r, r„) is a 1- 
Lipschitz function of the standard Gaussian matrix W, so that Theorem 2.26 implies the 
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upper tail bound P[T (re, r,) > E[T (re, ru)] + 6] < e-"®? for all 6 > 0. Define the constant 
C= 2 = i > L, Setting 6 = C +r, and using our upper bound on the mean (7.59) yields 


n2 n n n 
Pota) =3 Ffy sce n e2"<e e nu, 
PIT > 4 2 < C ru < r 


as claimed. 


7.7 Bibliographic details and background 


The Gaussian sequence model discussed briefly in Example 7.1 has been the subject of in- 
tensive study. Among other reasons, it is of interest because many nonparametric estimation 
problems can be “reduced” to equivalent versions in the (infinite-dimensional) normal se- 
quence model. The book by Johnstone (2015) provides a comprehensive introduction; see 
also the references therein. Donoho and Johnstone (1994) derive sharp upper and lower 
bounds on the minimax risk in ¢,,-norm for a vector belonging to an £,-ball, q € [0, 1], for 
the case of the Gaussian sequence model. The problem of bounding the in-sample prediction 
error for nonparametric least squares, as studied in Chapter 13, can also be understood as a 
special case of the Gaussian sequence model. 

The use of ¢)-regularization for ill-posed inverse problems has a lengthy history, with 
early work in geophysics (e.g., Levy and Fullagar, 1981; Oldenburg et al., 1983; Santosa 
and Symes, 1986); see Donoho and Stark (1989) for further discussion. Alliney and Ruzin- 
sky (1994) studied various algorithmic issues associated with €,-regularization, which soon 
became the subject of more intensive study in statistics and applied mathematics following 
the seminal papers of Chen, Donoho and Saunders (1998) on the basis pursuit program (7.9), 
and Tibshirani (1996) on the Lasso (7.18). Other authors have also studied various forms of 
non-convex regularization for enforcing sparsity; for instance, see the papers (Fan and Li, 
2001; Zou and Li, 2008; Fan and Lv, 2011; Zhang, 2012; Zhang and Zhang, 2012; Loh and 
Wainwright, 2013; Fan et al., 2014) and references therein. 

Early work on the basis pursuit linear program (7.9) focused on the problem of repre- 
senting a signal in a pair of bases, in which n is the signal length, and p = 2n indexes the 
union of the two bases of R”. The incoherence condition arose from this line of work (e.g., 
Donoho and Huo, 2001; Elad and Bruckstein, 2002); the necessary and sufficient condi- 
tions that constitute the restricted nullspace property seem to have been isolated for the first 
time by Feuer and Nemirovski (2003). However, the terminology and precise definition of 
restricted nullspace used here was given by Cohen et al. (2008). 

Juditsky and Nemirovsky (2000), Nemirovski (2000) and Greenshtein and Ritov (2004) 
were early authors to provide some high-dimensional guarantees for estimators based on 
€,-regularization, in particular in the context of function aggregation problems. Candés and 
Tao (2005) and Donoho (2006a; 2006b) analyzed the basis pursuit method for the case of 
random Gaussian or unitary matrices, and showed that it can succeed with n X, slog(ed/s) 
samples. Donoho and Tanner (2008) provided a sharp analysis of this threshold phenomenon 
in the noiseless case, with connections to the structure of random polytopes. The restricted 
isometry property was introduced by Candés and Tao (2005; 2007). They also proposed 
the Dantzig selector, an alternative ¢,-based relaxation closely related to the Lasso, and 
proved bounds on noisy recovery for ensembles that satisfy the RIP condition. Bickel et 
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al. (2009) introduced the weaker restricted eigenvalue (RE) condition, slightly different than 
but essentially equivalent to the version stated here, and provided a unified way to derive £2- 
error and prediction error bounds for both the Lasso and the Dantzig selector. Exercises 7.13 
and 7.14 show how to derive €,,-bounds on the Lasso error by using @,,-analogs of the £2- 
restricted eigenvalues; see Ye and Zhang (2010) for bounds on the Lasso and Dantzig errors 
using these and other types of restricted eigenvalues. Van de Geer and Bühlmann (2009) 
provide a comprehensive overview of different types of RE conditions, and the relationships 
among them; see also their book (Biihlmann and van de Geer, 2011). 

The proof of Theorem 7.13(a) is inspired by the proof technique of Bickel et al. (2009); 
see also the material in Chapter 9, and the paper by Negahban et al. (2012) for a general 
viewpoint on regularized M-estimators. There are many variants and extensions of the basic 
Lasso, including the square-root Lasso (Belloni et al., 2011), the elastic net (Zou and Hastie, 
2005), the fused Lasso (Tibshirani et al., 2005), the adaptive Lasso (Zou, 2006; Huang et al., 
2008) and the group Lasso (Yuan and Lin, 2006). See Exercise 7.17 in this chapter for 
discussion of the square-root Lasso, and Chapter 9 for discussion of some of these other 
extensions. 

Theorem 7.16 was proved by Raskutti et al. (2010). Rudelson and Zhou (2013) prove an 
analogous result for more general ensembles of sub-Gaussian random matrices; this analysis 
requires substantially different techniques, since Gaussian comparison results are no longer 
available. Both of these results apply to a very broad class of random matrices; for instance, 
it is even possible to sample the rows of the random matrix X € R’@ from a distribution 
with a degenerate covariance matrix, and/or with its maximum eigenvalue diverging with 
the problem size, and these results can still be applied to show that a (lower) restricted 
eigenvalue condition holds with high probability. Exercise 7.10 is based on results of Loh 
and Wainwright (2012). 

Exercise 7.12 explores the -error rates achievable by the Lasso for vectors that be- 
long to an €,-ball. These results are known to be minimax-optimal, as can be shown us- 
ing information-theoretic techniques for lower bounding the minimax rate. See Chapter 15 
for details on techniques for proving lower bounds, and the papers (Ye and Zhang, 2010; 
Raskutti et al., 2011) for specific lower bounds in the context of sparse linear regression. 

The slow rate and fast rates for prediction—that is, the bounds in equations (7.40) and 
(7.41) respectively—have been derived in various papers (e.g., Bunea et al., 2007; Candés 
and Tao, 2007; Bickel et al., 2009). It is natural to wonder whether the restricted eigenvalue 
conditions, which control correlation between the columns of the design matrix, should be 
required for achieving the fast rate. From a fundamental point of view, such conditions are 
not necessary: an fo-based estimator, one that performs an exhaustive search over all (£) 
subsets of size s, can achieve the fast rate with only a column normalization condition on 
the design matrix (Bunea et al., 2007; Raskutti et al., 2011); see Example 13.16 for an 
explicit derivation of the fast bound for this method. It can be shown that the Lasso itself 
is sub-optimal: a number of authors (Foygel and Srebro, 2011; Dalalyan et al., 2014) have 
given design matrices X and 2-sparse vectors for which the Lasso squared prediction error 
is lower bounded as 1/yn. Zhang et al. (2017) construct a harder design matrix for which 
the £)-based method can achieve the fast rate, but for which a broad class of M-estimators, 
one that includes the Lasso as well as estimators based on non-convex regularizers, has 
prediction error lower bounded as 1/-Yn. If, in addition, we restrict attention to methods 
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that are required to output an s-sparse estimator, then Zhang et al. (2014) show that, under 
a standard conjecture in complexity theory, no polynomial-time algorithm can achieve the 
fast rate (7.41) without the lower RE condition. 

Irrepresentable conditions for variable selection consistency were introduced indepen- 
dently by Fuchs (2004) and Tropp (2006) in signal processing, and by Meinshausen and 
Buhlmann (2006) and Zhao and Yu (2006) in statistics. The primal—dual witness proof of 
Theorem 7.21 follows the argument of Wainwright (2009b); see also this paper for exten- 
sions to general random Gaussian designs. The proof of Lemma 7.23 was suggested by 
Caramanis (personal communication, 2010). The primal—dual witness method that underlies 
the proof of Theorem 7.21 has been applied in a variety of other settings, including analysis 
of group Lasso (Obozinski et al., 2011; Wang et al., 2015) and related relaxations (Jalali 
et al., 2010; Negahban and Wainwright, 2011b), graphical Lasso (Ravikumar et al., 2011), 
methods for Gaussian graph selection with hidden variables (Chandrasekaran et al., 2012b), 
and variable selection in nonparametric models (Xu et al., 2014). Lee et al. (2013) describe 
a general framework for deriving consistency results using the primal—dual witness method. 

The results in this chapter were based on theoretically derived choices of the regulariza- 
tion parameter J,,, all of which involved the (unknown) standard deviation o of the additive 
noise. One way to circumvent this difficulty is by using the square-root Lasso estimator (Bel- 
loni et al., 2011), for which the optimal choice of regularization parameter does not depend 
on o. See Exercise 7.17 for a description and analysis of this estimator. 


7.8 Exercises 


Exercise 7.1 (Optimization and threshold estimators) 


(a) Show that the hard-thresholding estimator (7.6a) corresponds to the optimal solution o 
of the non-convex program 


a 1 2 1 2 
min {sl = A8 + 5 A. 


(b) Show that the soft-thresholding estimator (7.6b) corresponds to the optimal solution o 
of the €,-regularized quadratic program 


. f1 2 
min E — OllZ + ava} : 


Exercise 7.2 (Properties of f,-balls) For a given q € (0, 1], recall the (strong) ¢,-ball 


d 
Si ait < r) . (7.60) 


j=l 


B,(Rq) := l eR? 


The weak f,-ball with parameters (C, œ) is defined as 


Bwa (C) := {0 € Rf | |) < Ci for j = 1,...,d}. (7.61) 


Here |6|(;, denote the order statistics of 6* in absolute value, ordered from largest to smallest 


er 
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(a) Show that the set B,(R,) is star-shaped around the origin. (A set C C R? is star-shaped 
around the origin if 9 € C => t0 € C for all t € [0, 1].) 

(b) For any a > 1/q, show that there is a radius R, depending on (C, œ) such that Byey(C) © 

B,(R,). This inclusion underlies the terminology “strong” and “weak”, respectively. 

(c) For a given integer s € {1,2,...,d}, the best s-term approximation to a vector @* € Rf is 
given by 


11, (6") := arg min lle — 6"|I3. (7.62) 


Give a closed-form expression for IT,(6"). 
(d) When 6* € B,(R,) for some q € (0, 1], show that the best s-term approximation satisfies 


2-1 
I6) — O° < (R? (+ 7 (7.63) 


Exercise 7.3 (Pairwise incoherence) Given a matrix X € R’“, suppose that it has pairwise 
incoherence (7.12) upper bounded as ôpw(X) < +. 


(a) Let S c {1,2,...,d} be any subset of size s. Show that there is a function y > c(y) such 
that Ymin (==) > c(y) > 0, as long as y is sufficiently small. 


(b) Prove that X satisfies the restricted nullspace property with respect to S as long as 
y < 1/3. (Do this from first principles, without using any results on restricted isometry.) 


Exercise 7.4 (RIP and pairwise incoherence) In this exercise, we explore the relation be- 
tween the pairwise incoherence and RIP constants. 


(a) Prove the sandwich relation (7.15) for the pairwise incoherence and RIP constants. Give 
a matrix for which inequality (i) is tight, and another matrix for which inequality (ii) is 
tight. 

(b) Construct a matrix such that 6,(X) = Vs ôpw(X). 


Exercise 7.5 (-RE > 41-RE) Let S c {1,2,...,d} be a subset of cardinality s. A matrix 
X € R” satisfies an €,-RE condition over § with parameters (y1, œ1) if 
Xal 7 IIA; 


yı — for all 0 € C(S; œ). 
n S 


Show that any matrix satisfying the f2-RE condition (7.22) with parameters (y2, a2) satisfies 
the €;-RE condition with parameters yı = aed) and a; = a. 
Exercise 7.6 (Weighted ¢\-norms) In many applications, one has additional information 
about the relative scalings of different predictors, so that it is natural to use a weighted ¢,- 
norm, of the form ||6||,”1) := aes w |0;|, where w € Rf is a vector of strictly positive weights. 
In the case of noiseless observations, this leads to the weighted basis pursuit LP 


min ||4|I,.1) such that XO = y. 
OERI 


(a) State and prove necessary and sufficient conditions on X for the weighted basis pursuit 
LP to (uniquely) recover all k-sparse vectors 6*. 
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(b) Suppose that 6° is supported on a subset S of cardinality s, and the weight vector w 


satisfies 
( ifjes, 
Wj = 


t otherwise, 


for some t > 1. State and prove a sufficient condition for recovery in terms of Cmin = 
Ymin(X5 Xs /n), the pairwise incoherence dpw(X) and the scalar t. How do the conditions 
on X behave as t > +00? 


Exercise 7.7 (Pairwise incoherence and RIP for isotropic ensembles) Consider a random 
matrix X € R’ with i.id. N(0, 1) entries. 


(a) For a given s € {1,2,...,d}, suppose that n = s? log d. Show that the pairwise incoher- 
ence satisfies the bound dpw(X) < + with high probability. 

(b) Now suppose thatn = slog (Ẹ). Show that the RIP constant satisfies the bound 62, < 1/3 
with high probability. 


Exercise 7.8 (Violations of pairwise incoherence and RIP) Recall the ensemble of spiked 
identity covariance matrices from Example 7.18 with a constant u > 0, and consider an 
arbitrary sparsity level s € {1,2,...,d}. 


(a) Violation of pairwise incoherence: show that 
P[dpw(X) > u- 36) > 1-628 forall 6 € (0, 1/V2). 


Consequently, a pairwise incoherence condition cannot hold unless u < L, 
(b) Violation of RIP: Show that 


PIS (X) > (1 +(V2s- 1)u)ô] > 1- e”? forall 6 € (0,1). 
Consequently, a RIP condition cannot hold unless uy « i 


Exercise 7.9 (Relations between fọ and £; constraints) For an integer k € {1,...,d}, con- 
sider the following two subsets: 


o(k) := B2(1) N Bo(k) = {0 € R | [[6ll2 < 1 and [6llo < k}, 
i(k) := Ba(1) N By (Vi) = {6 € R° | Jløllz < 1 and Jlli < Vk}. 


For any set L, let conv(L) denote the closure of its convex hull. 


(a) Prove that conv(Lo(k)) € Li(k). 
(b) Prove that L: (k) € 2 conv(Lo(k)). 


(Hint: For part (b), you may find it useful to consider the support functions of the two sets.) 


Exercise 7.10 (Sufficient conditions for RE) Consider an arbitrary symmetric matrix I for 
which there is a scalar 6 > O such that 


l'ra<6 forall @ € Lo(2s), 


where the set Lọ was defined in Exercise 7.9. 
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(a) Show that 


T 126|IA|l3 for all vectors such that ||@]|; < vs |l@ll2, 
l0 TOIS +05 . 
2 |||; otherwise. 
(Hint: Part (b) of Exercise 7.9 could be useful.) 
(b) Use part (a) to show that RIP implies the RE condition. 
(c) Give an example of a matrix family that violates RIP for which part (a) can be used to 
guarantee the RE condition. 


Exercise 7.11 (Weaker sufficient conditions for RE) Consider a covariance matrix X with 
minimum eigenvalue Ymin(&) > 0 and maximum variance p(d). 


(a) Show that the lower bound (7.31) implies that the RE condition (7.22) holds with pa- 
rameter K = FY min(Z) over C,(S), uniformly for all subsets S of cardinality at most 


Ci Ymi È) -2 n 
IS] < Zen PE (1 +a) lgd’ 


(b) Give a sequence of covariance matrices {£} for which Ymax(2) diverges, but part (a) 
can still be used to guarantee the RE condition. 


Exercise 7.12 (Estimation over £,-“balls”) In this problem, we consider linear regression 
with a vector 6* € B,(R,) for some radius R} > 1 and parameter q € (0,1] under the 
following conditions: (a) the design matrix X satisfies the lower bound (7.31) and uniformly 
bounded columns (||X;l2/yn < 1 for all j = 1,...,d); (b) the noise vector w € R” has 
i.i.d. zero-mean entries that are sub-Gaussian with parameter o. 

Using Theorem 7.19 and under an appropriate lower bound on the sample size n in terms 


of (d, R4, ©, q), Show that there are universal constants (co, c1, C2) such that, with probability 


1 — cye~?'°84, any Lasso solution @ satisfies the bound 


~ o ologd\'? 
I- eG < co =] 


(Note: The universal constants can depend on quantities related to Ł, as in the bound (7.31).) 


Exercise 7.13 (¢..-bounds for the Lasso) Consider the sparse linear regression model 
y = X6* + w, where w ~ N(0, 07 Inxn) and 6 € Rf is supported on a subset S. Suppose that 
the sample covariance matrix È= 1XTX has its diagonal entries uniformly upper bounded 
by one, and that for some parameter y > 0, it also satisfies an f..-curvature condition of the 
form 


IEAlls > yllAllo for all A € C3(S). (1.64) 
Show that with the regularization parameter A, = 40 wed any Lasso solution satisfies the 
€,,-bound 
h 6r Jlogd 
iP- le < J 


with high probability. 
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Exercise 7.14 (Verifying ¢..-curvature conditions) This problem is a continuation of Ex- 
ercise 7.13. Suppose that we form a random design matrix X € R’¢ with rows drawn i.i.d. 
from a N(0, X) distribution, and moreover that 


ZA]loo = VIlAlloo for all vectors A € C3(S). 


Show that, with high probability, the sample covariance Ls 1X"X satisfies this same 
property with y/2 as long as n = $ logd. 


Exercise 7.15 (Sharper bounds for Lasso) Let X € R”? be a fixed design matrix such that 


“AS < C for all subsets $ of cardinality at most s. In this exercise, we show that, with high 


probability, any solution of the constrained Lasso (7.19) with R = ||6"||, satisfies the bound 


~ 1 
(0- on 3 2 fee (7.65) 
K n 


where s = ||6*||o. Note that this bound provides an improvement for linear sparsity (i.e., 
whenever s = ad for some constant a € (0, 1)). 


(a) Define the random variable 
Z := sup 


1 
(a Txw) 
AER? n 


where w ~ N(0,07/). Show that 


Z Isl 2 
P á >c slog(ed/s) + 5 < oe 8" 
Co n 


for universal constants (c1, C2, ¢3). (Hint: The result of Exercise 7.9 may be useful here.) 
(b) Use part (a) and results from the chapter to show that if X satisfies an RE condition, then 


such that ||Allz < 1 and ||Al|,; < Vs, (7.66) 


any optimal Lasso solution @ satisfies the bound (7.65) with probability 1 — cesses (2), 


Exercise 7.16 (Analysis of weighted Lasso) In this exercise, we analyze the weighted 
Lasso estimator 


a 1 
ĝe in 4 —lly — X06 + AnllOllay ¢- 
arg min ly XA + Aao} 


where |lêllya) := Da v;l|0;| denotes the weighted €,-norm defined by a positive weight vector 
v € R“. Define C j= TE, where X; € R” denotes the jth column of the design matrix, and 
let A = 0 — & be the error vector associated with an optimal solution 8. 
(a) Suppose that we choose a regularization parameter 2, > AT Kai, Show that the 
vector A belongs to the modified cone set 
C3(S;v) := {A € R° | ll|Asellay < 3llAs lap}: (7.67) 
(b) Assuming that X satisfies a x-RE condition over C,(S ; 3), show that 


~ 6 > 
B- 6l < 24 | v4. 
jes 
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(c) For a general design matrix, the rescaled column norms C; = ||X;ll2/ Vn may vary widely. 
Give a choice of weights for which the weighted Lasso error bound is superior to the 


ordinary Lasso bound. (Hint: You should be able to show an improvement by a factor of 
max jes C; 

ae) 

max j=1 


Exercise 7.17 (Analysis of square-root Lasso) The square-root Lasso is given by 


~ 1 
Oe in 4 — lly — X4|l. + y,|lAll; ? - 
ag min f- l ll + Yall n} 


(a) Suppose that the regularization parameter y, is varied over the interval (0, co). Show 
that the resulting set of solutions coincides with those of the Lagrangian Lasso as 4, is 


varied. 
(b) Show that any square-root Lasso estimate @ satisfies the equality 
XMKO-y) a 
a a a + =0, 
lly - Xl 


where Z€ R’ belongs to the subdifferential of the £1-norm at 0. 
(c) Suppose y = X6" + w where the unknown regression vector 6” is S-sparse. Use part (b) 


to establish that the error A = 6 — 6” satisfies the basic inequality 
lly - X4ll 
"vn 


. Show that the error vector satisfies the cone constraint 


los ~ 1 = PI 
-IXA < (a =x") + MlAslh = llAsellı}- 


XT Wlloo 
Vallwll2 


(d) Suppose that y, > 2 


lAsell1 < 3||Aslh. 
(e) Suppose in addition that X satisfies an RE condition over the set C3(S). Show that there 
is a universal constant c such that 


ae Iwll 
(0 - oll < cy, Vs. 


yn 
Exercise 7.18 (From pairwise incoherence to irrepresentable condition) Consider a matrix 


X € R’™ whose pairwise incoherence (7.12) satisfies the bound ôpw(X) < x Show that the 
irrepresentable condition (7.43b) holds for any subset S of cardinality at most s. 


Exercise 7.19 (Irrepresentable condition for random designs) Let X € R’“ be a random 
matrix with rows {x;};_; sampled i.i.d. according to a N(0, X) distribution. Suppose that the 
diagonal entries of ÈX are at most 1, and that it satisfies the irrepresentable condition with 
parameter a € [0, 1)—that is, 


max ||Ejs(Zss)- ‘lh <a<l. 
Let z € R° be a random vector that depends only on the submatrix Xs. 
(a) Show that, for each j € S°, 
IX Xs (X5Xs) ‘I <at |W; Xs(X5Xs) ‘zl, 


where W, € R” is a Gaussian random vector, independent of Xs. 
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(b) Use part (a) and random matrix/vector tail bounds to show that 


max IX; Xs XFX y z <a’ := 1a +a), 
JES: 


with probability at least 1 — 4e~°'°84, as long as n > 
Cmin = Ymin(ss ). 


aaa slog(d — s), where 


Exercise 7.20 (Analysis of €-regularization) Consider a design matrix X € R’” satisfying 
the £o-based upper/lower RE condition 


2 _ XAI 2 
yellAll; < raat < yullAlls for all ||Allo < 2s. (7.68) 


Suppose that we observe noisy samples y = X6* + w for some s-sparse vector 6*, where the 
noise vector has i.i.d. N(0,g°) entries. In this exercise, we analyze an estimator based on 
the €-constrained quadratic program 


1 
min {lb — xak} such that ||6|lp < s. (7.69) 
eR? | 2n 

(a) Show that the non-convex program (7.69) has a unique optimal solution @ € R£. 

(b) Using the “basic inequality” proof technique, show that 

oy, Ss log(ed/s) 

Ye 


ie- el x 


n 


with probability at least 1 — cje~5!8¢4/), (Hint: The result of Exercise 5.7 could be 
useful to you.) 


8 


Principal component analysis in high dimensions 


Principal component analysis (PCA) is a standard technique for exploratory data analysis 
and dimension reduction. It is based on seeking the maximal variance components of a 
distribution, or equivalently, a low-dimensional subspace that captures the majority of the 
variance. Given a finite collection of samples, the empirical form of principal component 
analysis involves computing some subset of the top eigenvectors of the sample covariance 
matrix. Of interest is when these eigenvectors provide a good approximation to the subspace 
spanned by the top eigenvectors of the population covariance matrix. In this chapter, we 
study these issues in a high-dimensional and non-asymptotic framework, both for classical 
unstructured forms of PCA as well as for more modern structured variants. 


8.1 Principal components and dimension reduction 


Let S?*4 denote the space of d-dimensional positive semidefinite matrices, and denote the 
d-dimensional unit sphere by S¢! = {v € R? | ||v]lz = 1}. Consider a d-dimensional random 
vector X, say with a zero-mean vector and covariance matrix £X € S““. We use 


yE) > YE) > -> yX) = 0 


to denote the ordered eigenvalues of the covariance matrix. In its simplest instantiation, 
principal component analysis asks: along what unit-norm vector v € S*! is the variance 
of the random variable (v, X} maximized? This direction is known as the first principal 
component at the population level, assumed here for the sake of discussion to be unique. In 
analytical terms, we have 


v* = arg max var((v, XY) = arg max E| (v, XY ] = arg max (v, Ev), (8.1) 
vesd! veSd-! veSd-! 


so that by definition, the first principal component is the maximum eigenvector of the co- 
variance matrix X. More generally, we can define the top r principal components at the 
population level by seeking an orthonormal matrix V € R“*", formed with unit-norm and 


orthogonal columns {v,...,v,}, that maximizes the quantity 
EIIV™XID = $ Elv, X]. (8.2) 
j=l 


As we explore in Exercise 8.4, these principal components are simply the top r eigenvectors 
of the population covariance matrix X. 
In practice, however, we do not know the covariance matrix, but rather only have access 


236 
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to a finite collection of samples, say {x;}'_,, each drawn according to P. Based on these 
samples (and using the zero-mean assumption), we can form the sample covariance matrix 
r= 1 dL x;x;. The empirical version of PCA is based on the “plug-in” principle, namely 
replacing the unknown population covariance X with this empirical version È. For instance, 
the empirical analog of the first principal component (8.1) is given by the optimization prob- 
lem 

v = arg max (v, Ev}. (8.3) 
Consequently, from the statistical point of view, we need to understand in what sense the 
maximizers of these empirically defined problems provide good approximations to their 
population analogs. Alternatively phrased, we need to determine how the eigenstructures of 


the population and sample covariance matrices are related. 


8.1.1 Interpretations and uses of PCA 


Before turning to the analysis of PCA, let us consider some of its interpretations and appli- 
cations. 


Example 8.1 (PCA as matrix approximation) Principal component analysis can be inter- 
preted in terms of low-rank approximation. In particular, given some unitarily invariant! 
matrix norm ||| - |||, consider the problem of finding the best rank-r approximation to a given 
matrix X—that is, 
* : 2 

Z = arg min (WE - ZIP}. (8.4) 
In this interpretation, the matrix X need only be symmetric, not necessarily positive semi- 
definite as it must be when it is a covariance matrix. A classical result known as the Eckart- 
Young—Mirsky theorem guarantees that an optimal solution Z* exists, and takes the form of 
a truncated eigendecomposition, specified in terms of the top r eigenvectors of the matrix X. 
More precisely, recall that the symmetric matrix X has an orthonormal basis of eigenvectors, 
say {v1,..., Va}, associated with its ordered eigenvalues {y ja In terms of this notation, 
the optimal rank-r approximation takes the form 


Z = 2, y(Z) (v; 2v;), (8.5) 
= 


where v; @v, :=v Vy is the rank-one outer product. For the Frobenius matrix norm |||M||lp = 


Oat Mis the error in the optimal approximation is given by 


d 
WZ" - 2 = > yj). (8.6) 


j=r+1 
Figure 8.1 provides an illustration of the matrix approximation view of PCA. We first 


' For a symmetric matrix M, a matrix norm is unitarily invariant if |M] = |[_W'MV|l for any orthonormal 
matrix V. See Exercise 8.2 for further discussion. 
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Eigenspectrum of a covariance matrix 
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Figure 8.1 Illustration of PCA for low-rank matrix approximation. (a) Eigenspec- 
trum of a matrix £ e S!*!9 generated as described in the text. Note the extremely 
rapid decay of the sorted eigenspectrum. Dark diamonds mark the rank cutoffs 
r € {5, 10,25, 100}, the first three of which define three approximations to the whole 
matrix (r = 100.) (b) Top left: original matrix. Top right: approximation based on 
r = 5 components. Bottom left: approximation based on r = 10 components. Bottom 
right: approximation based on r = 25 components. 


generated the Toeplitz matrix T € S? with entries T, = -a VIH with œ = 0.95, and then 
formed the recentered matrix X := T — Ymin(T)Iy. Figure 8.1(a) shows the eigenspectrum 
of the matrix È: note that the rapid decay of the eigenvalues that renders it amenable to an 
accurate low-rank approximation. The top left image in Figure 8.1(b) corresponds to the 
original matrix Ł, whereas the remaining images illustrate approximations with increasing 
rank (r = 5 in top right, r = 10 in bottom left and r = 25 in bottom right). Although the 
defects in approximations with rank r = 5 or r = 10 are readily apparent, the approximation 
with rank r = 25 seems reasonable. & 


Example 8.2 (PCA for data compression) Principal component analysis can also be inter- 
preted as a linear form of data compression. Given a zero-mean random vector X € R4, a 
simple way in which to compress it is via projection to a lower-dimensional subspace V— 
say via a projection operator of the form Iy(X). For a fixed dimension r, how do we choose 
the subspace V? Consider the criterion that chooses V by minimizing the mean-squared error 


E[IX - TVO]. 


This optimal subspace need not be unique in general, but will be when there is a gap between 
the eigenvalues y,(Z) and y,,)(2). In this case, the optimal subspace V* is spanned by the top 
r eigenvectors of the covariance matrix X& = cov(X). In particular, the projection operator 
Tly: can be written as Hy- (x) = V,V! x, where V, € R2” is an orthonormal matrix with 
the top r eigenvectors {v,,...,v,} as its columns. Using this optimal projection, the minimal 
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Eigenspectrum of sample covariance 
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Figure 8.2 (a) Samples of face images from the Yale Face Database. (b) First 100 
eigenvalues of the sample covariance matrix. (c) First 25 eigenfaces computed from 
the sample covariance matrix. (d) Reconstructions based on the first 25 eigenfaces 
plus the average face. 


reconstruction error based on a rank-r projection is given by 


d 
E[IIX - mW] = >) vj. (8.7) 


j=r+1 


where {y D are the ordered eigenvalues of X. See Exercise 8.4 for further exploration 
of these and other properties. 

The problem of face analysis provides an interesting illustration of PCA for data com- 
pression. Consider a large database of face images, such as those illustrated in Figure 8.2(a). 
Taken from the Yale Face Database, each image is gray-scale with dimensions 243 x 320. 
By vectorizing each image, we obtain a vector x in d = 243 x 320 = 77760 dimen- 
sions. We compute the average image x = +>), x; and the sample covariance matrix 
x= L SLC ax — X)" based on n = 165 samples. Figure 8.2(b) shows the rela- 
tively fast decay of the first 100 eigenvalues of this sample covariance matrix. Figure 8.2(c) 
shows the average face (top left image) along with the first 24 “eigenfaces”, meaning the 
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top 25 eigenvectors of the sample covariance matrix, each converted back to a 243 x 320 
image. Finally, for a particular sample, Figure 8.2(d) shows a sequence of reconstructions 
of a given face, starting with the average face (top left image), and followed by the average 
face in conjunction with principal components 1 through 24. & 


Principal component analysis can also be used for estimation in mixture models. 


Example 8.3 (PCA for Gaussian mixture models) Let ¢(-; u, X) denote the density of a 
Gaussian random vector with mean vector u € R? and covariance matrix E € S*4. A two- 
component Gaussian mixture model with isotropic covariance structure is a random vector 
X € R? drawn according to the density 


£050) = a G(x; -0°, La) + (1 = a) 6056", L), (8.8) 


where 6* € R? is a vector parameterizing the means of the two Gaussian components, 
a € (0, 1) is a mixture weight and o > 0 is a dispersion term. Figure 8.3 provides an illus- 
tration of such a mixture model in d = 2 dimensions, with mean vector 6* = [0.6 -0.6] . 
standard deviation 0 = 0.4 and weight a = 0.4. Given samples {x;}'_, drawn from such a 
model, a natural goal is to estimate the mean vector 6*. Principal component analysis pro- 
vides a natural method for doing so. In particular, a straightforward calculation yields that 


the second-moment matrix 


T := E[X 8 X] = # 8 & + °l; 


where X & X := XX" is the d x d rank-one outer product matrix. Thus, we see that 6° 
is proportional to the maximal eigenvector of F. Consequently, a reasonable estimator 8 is 


Gaussian mixture model in 2 dimensions Contour map of Gaussian mixture 
1.5 


Second dimension 
© 


-1 o 1 
First dimension 


(b) 


Figure 8.3 Use of PCA for Gaussian mixture models. (a) Density function of a two- 
component Gaussian mixture (8.8) with mean vector 6* = [0.6 —0.6]', standard 
deviation o = 0.4 and weight a = 0.4. (b) Contour plots of the density function, 
which provide intuition as to why PCA should be useful in recovering the mean 
vector 6°. 
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given by the maximal eigenvector of the sample second moment” matrix T= 7 Èi- XiX; 


We study the properties of this estimator in Exercise 8.6. + 


8.1.2 Perturbations of eigenvalues and eigenspaces 


Thus far, we have seen that the eigenvectors of population and sample covariance matri- 
ces are interesting objects with a range of uses. In practice, PCA is always applied to the 
sample covariance matrix, and the central question of interest is how well the sample-based 
eigenvectors approximate those of the population covariance. 

Before addressing this question, let us make a brief detour into matrix perturbation theory. 
Let us consider the following general question: given a symmetric matrix R, how does its 
eigenstructure relate to the perturbed matrix Q = R+P? Here P is another symmetric matrix, 
playing the role of the perturbation. It turns out that the eigenvalues of Q and R are related 
in a straightforward manner. Understanding how the eigenspaces change, however, requires 
some more care. 

Let us begin with changes in the eigenvalues. From the standard variational definition of 
the maximum eigenvalue, we have 


yi(Q) = max (v, (R + P)v) < me (v, Rv) + max {v, Pv) < yi(R) + IIPII2. 
vesd! veSt! vesd! 


Since the same argument holds with the roles of Q and R reversed, we conclude that 
Wi (Q) -y (R)I < IIQ — RI. Thus, the maximum eigenvalues of Q and R can differ by 
at most the operator norm of their difference. More generally, we have 


max, [y/(Q) - ¥(R)| < IIQ - Ril. (8.9) 


This bound is a consequence of a more general result known as Weyl’s inequality; we work 
through its proof in Exercise 8.3. 

Although eigenvalues are generically stable, the same does not hold for eigenvectors and 
eigenspaces, unless further conditions are imposed. The following example provides an il- 
lustration of such instability: 


Example 8.4 (Sensitivity of eigenvectors) For a parameter e € [0, 1], consider the family 
of symmetric matrices 


+E 


elt epee 0 01 
ne f a 1.01 1 al (8.10) 


STS 

Qo P 
By construction, the matrix Q, is a perturbation of a diagonal matrix Qo by an e-multiple 
of the fixed matrix P. Since ||P, = 1, the magnitude of the perturbation is directly con- 
trolled by e. On one hand, the eigenvalues remain stable to this perturbation: in terms of the 
shorthand a = 1.01, we have y(Qo) = {1, a} and 


WQ.) = {3[(a+1) + Vla-1? +4e], iat- Va- 1? + 4e?]}. 


2 This second-moment matrix coincides with the usual covariance matrix for the special case of an equally 
weighted mixture pair with a = 0.5. 
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Thus, we find that 


= ia- 1)- Va -17 +4e| <e, 


which confirms the validity of Weyl’s inequality (8.9) in this particular case. 

On the other hand, the maximal eigenvector of Qe is very different from that of Qo, even 
for relatively small values of e. For € = 0, the matrix Qo has the unique maximal eigenvector 
vo = [0 1]'. However, if we set €e = 0.01, a numerical calculation shows that the maximal 
eigenvector of Qe is ve x [0.53 0.85]". Note that |v — velz >> €, showing that eigenvectors 
can be extremely sensitive to perturbations. & 


max |y (Qo) — y(Qe) 
j=1,2 


What is the underlying problem? The issue is that, while Qo has a unique maximal eigen- 
vector, the gap between the largest eigenvalue y;(Qo) = 1.01 and the second largest eigen- 
value ¥2(Qo) = 1 is very small. Consequently, even small perturbations of the matrix lead 
to “mixing” between the spaces spanned by the top and second largest eigenvectors. On the 
other hand, if this eigengap can be bounded away from zero, then it turns out that we can 
guarantee stability of the eigenvectors. We now turn to this type of theory. 


8.2 Bounds for generic eigenvectors 


We begin our exploration of eigenvector bounds with the generic case, in which no additional 
structure is imposed on the eigenvectors. In later sections, we turn to structured variants of 
eigenvector estimation. 


8.2.1 A general deterministic result 


Consider a symmetric positive semidefinite matrix X with eigenvalues ordered as 
nÈ) 2 YE) = yE) 2 +++ 2 ya) 2 O. 


Let 6* € R denote its maximal eigenvector, assumed to be unique. Now consider a perturbed 
version È = 2+P of the original matrix. As suggested by our notation, in the context of PCA, 
the original matrix corresponds to the population covariance matrix, whereas the perturbed 
matrix corresponds to the sample covariance. However, at least for the time being, our theory 
should be viewed as general. 

As should be expected based on Example 8.4, any theory relating the maximum eigen- 
vectors of © and È should involve the eigengap v := y\(2) — y2(%), assumed to be strictly 
positive. In addition, the following result involves the transformed perturbation matrix 


_ = =T 
P= UPU = [Py E | (8.11) 
p Py 


where fj, € R, p € RA! and Pa, € R“©*“-), Here U is an orthonormal matrix with the 
eigenvectors of X as its columns. 
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a D 
Theorem 8.5 Consider a positive semidefinite matrix X& with maximum eigenvector 
6 e S! and eigengap v = y;(X) — y2(Z) > 0. Given any matrix P € S® such that 
IIPI|2 < v/2, the perturbed matrix E :=E+Phasa unique maximal eigenvector o 
satisfying the bound 


2|lAll2 


pane 
> * y= 2IIP lb 


(8.12) 


In general, this bound is sharp in the sense that there are problems for which the require- 
ment ||P, < v/2 cannot be loosened. As an example, suppose that X = diag{2, 1} so that 
y=2-1=1.GivenP = diag{-}, +3}, the perturbed matrix L=L+P= 31, no longer has 
a unique maximal eigenvector. Note that this counterexample lies just at the boundary of our 


requirement, since |||P|ll2 = 5 = 5. 


Proof Our proof is variational in nature, based on the optimization problems that charac- 
terize the maximal eigenvectors of the matrices X and &, respectively. Define the error vector 


A = 0 — 6, and the function 
W(A;P) := (A, PA) + 2(A, PO). (8.13) 


In parallel to our analysis of sparse linear regression from Chapter 7, the first step in our 
analysis is to prove the basic inequality for PCA. For future reference, we state this inequal- 
ity in a slightly more general form than required for the current proof. In particular, given 
any subset C c S“!, let 6* and @ maximize the quadratic objectives 


max (0,20) and max(6, 6), (8.14) 
eC eC 
respectively. The current proof involves the choice C = S“!. 
It is convenient to bound the distance between @ and 6° in terms of the inner product 


0 = (0, 6"). Due to the sign ambiguity in eigenvector estimation, we may assume without 
loss of generality that 8 is chosen such that o € [0, 1]. 


Lemma 8.6 (PCA basic inequality) Given a matrix X with eigengap v > 0, the error 


A = 0 — @ is bounded as 
v(1- (6, 0°) ) < P&P]. (8.15) 


Taking this inequality as given for the moment, the remainder of the proof is straightforward. 
Recall the transformation P = U'PU, or equivalently P = UPUT. Substituting this expres- 
sion into equation (8.13) yields 


P(A; P) = (UA, PUA) y2 (UTA, PUR) (8.16) 


In terms of the inner product ọ = (6, 6"), we may write 6 = o @ + 1 — o?z, where z € R? 
is a vector orthogonal to 6*. Since the matrix U is orthonormal with its first column given by 
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6*, we have U'@* = e. Letting U, € R““" denote the submatrix formed by the remaining 
d — 1 eigenvectors and defining the vector Z = U}z € R“', we can write 


UTA=[@-1) U-e)Z] . 
Substituting these relations into equation (8.16) yields that 
PA; P) = (0-1) +20- 1) V1- & P) + 1 -e°) (z, Pazz) 
+2(0- Dp +2 V1- G, P) 
=(@ - Dp +20 V1- 0 & P) + (1 - o°) (Z, P232). 
Putting together the pieces, since |[Z||2 < 1 and |u| < IIPil2, we have 
IYA; P)| < 2(1 = © )IIPllb + 20 V1 = o7llAllo- 
Combined with the basic inequality (8.15), we find that 
vd -= 0°) S$ 2(1 - @)IIPllb + 20 V1 = e7llAlle. 


Whenever v > 2(|Plll2, this inequality implies that y1 — o? < He. Noting that All = 
/2(1 — 0), we thus conclude that 


V2o ( 2|lAll2 )< 2\lAll2 
V1+o Y- 2P ~ v- 2IIP Ills’ 


where the final step follows since 207 < 1 + @ for all ọ € [0, 1]. 


[Ally < 


Let us now return to prove the PCA basic inequality (8.15). 


Proof of Lemma 8.6: Since 6 and 6° are optimal and feasible, respectively, for the pro- 
grams (8.14), we are guaranteed that (e, x o) < (6, EO). Defining the matrix perturbation 


P DPEN: we have 
KZ, 90-080) < -(P, 6° g0 - 09), 
where (A, BY is the trace inner product, and a®a = aa” denotes the rank-one outer product. 


Following some simple algebra, the right-hand side is seen to be equal to —Y(A; P). The final 
step is to show that 


(x, 6° oë -0@0) > “ING. (8.17) 


Recall the representation p= o6* + (4/1 — ©?) z, where the vector z € R? is orthogonal to 
6°, and o € [0, 1]. Using the shorthand notation y; = y;(2) for j = 1,2, define the matrix 
T = £ - yı (& @ 6), and note that r6 = 0 and ||, < y2 by construction. Consequently, we 
can write 


KE, #9% -080 =y (8 @6", 9-080) +T, 6 OG -0@0) 
=(1- {yı - T, z8z}}. 
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Since ||Plllz < y2, we have IKI, z ® z| < y2. Putting together the pieces, we have shown that 
«2, 6° @6" -086) > (1- oy — ya} = (1- 0°)», 


from which the claim (8.15) follows. 


8.2.2 Consequences for a spiked ensemble 


Theorem 8.5 applies to any form of matrix perturbation. In the context of principal com- 
ponent analysis, this perturbation takes a very specific form—namely, as the difference be- 
tween the sample and population covariance matrices. More concretely, suppose that we 
have drawn n i.i.d. samples {x;}"_, from a zero-mean random vector with covariance X. Prin- 
cipal component analysis is then based on the eigenstructure of the sample covariance matrix 
È= 1 Di xix}, and the goal is to draw conclusions about the eigenstructure of the popula- 
tion matrix. 

In order to bring sharper focus to this issue, let us study how PCA behaves for a very 
simple class of covariance matrices, known as spiked covariance matrices. A sample x; € R? 
from the spiked covariance ensemble takes the form 


x SWE 0 +w, (8.18) 


where é; € R is a zero-mean random variable with unit variance, and w; € R? is a random 
vector independent of &;, with zero mean and covariance matrix Ij. Overall, the random 
vector x; has zero mean, and a covariance matrix of the form 


L:= vO (6) +L. (8.19) 


By construction, for any v > 0, the vector 6* is the unique maximal eigenvector of X with 
eigenvalue y;(Z) = v + 1. All other eigenvalues of È are located at 1, so that we have an 
eigengap yı (©) - y2(X) = v. 

In the following result, we say that the vector x; € R? has sub-Gaussian tails if both £; and 
w; are sub-Gaussian with parameter at most one. 


Corollary 8.7 Given i.i.d. samples {x;}'_, from the spiked covariance ensemble (8.18) 
with sub-Gaussian tails, suppose thatn > d and ,| ut 1c < ort Then, with probability 


at least 1 — cye~2" Mi W6.¥") there is a unique maximal eigenvector 0 of the sample 


covariance matrix XL = 1 Dh xXx, such that 


2 1 fa 
(0-6 < an ee ô. (8.20) 
y n 


x 4 


Figure 8.4 shows the results of simulations that confirm the qualitative scaling predicted 
by Corollary 8.7. In each case, we drew n = 500 samples from a spiked covariance matrix 
with the signal-to-noise parameter v ranging over the interval [0.75, 5]. We then computed 
the €-distance \io — 6" ||. between the maximal eigenvectors of the sample and population 
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Figure 8.4 Plots of the error lio - @*||2 versus the signal-to-noise ratio, as measured 
by the eigengap v. Both plots are based on a sample size n = 500. Dots show the 
average of 100 trials, along with the standard errors (crosses). The full curve shows 


the theoretical bound ,/ zl T . (a) Dimension d = 100. (b) Dimension d = 250. 


covariances, respectively, performing T = 100 trials for each setting of v. The circles in Fig- 
ure 8.4 show the empirical means, along with standard errors in crosses, whereas the solid 


curve corresponds to the theoretical prediction J= Ni . Note that Corollary 8.7 predicts 
this scaling, but with a looser leading constant (co > 1). As shown by Figure 8.4, Corol- 
lary 8.7 accurately captures the scaling behavior of the error as a function of the signal-to- 
noise ratio. 


Proof LetP = E- be the difference between the sample and population covariance ma- 
trices. In order to apply Theorem 8.5, we need to upper bound the quantities |||P|ll2 and ||Øll2. 
Defining the random vector w := 1 >, &iwi, the perturbation matrix P can be decomposed 
as 


= 1 g 2 ENT Se aT *-T 1 n r 
p= (2 dvs -i}rw ey ey) pa! =i. (8.21) 
n a P» > 


Since ||6"||2 = 1, the operator norm of P can be bounded as 
1x 1x 
IPI < v|- $E — 1] +2 vr +I- X wae? — Hll- (8.22a) 
ere rs 


Let us derive a similar upper bound on ||/||2 using the decomposition (8.11). Since 6* is the 
unique maximal eigenvector of È, it forms the first column of the matrix U. Let U, € R&C? 
denote the matrix formed of the remaining (d — 1) columns. With this notation, we have 
P = U}P@"*. Using the decomposition (8.21) of the perturbation matrix and the fact that 
Use" = 0, we find that p = YvUjw + + >, US w; (w; 6"). Since U, has orthonormal 
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columns, we have ||U3W\l2 < |lwll2 and also 


> US wi (wi 6*) Ip = sup (U(X wiw? - 1)6" 
i=l llvll2=1 i=l 


I% r 
< |i- iwi — Tall. 
< II 2, ww? — Tall 
Putting together the pieces, we have shown that 
: Lx 
ll < VYW + I=)" wiw? — Lll. (8.22b) 
kem 


The following lemma allows us to control the quantities appearing the bounds (8.22a) and 
(8.22b): 


Lemma 8.8 Under the conditions of Corollary 8.7, we have 
PIl- > 716i) 2e en (8.23a) 

- d —cyn min{d2, 62} 
P[llwll2 > 24/= + 62] < 2e°° 2903 (8.23b) 

n 
and 
1 Š T d —czn min{53,62} 
Pil X ww; - lll > c3 — +63] < 2¢ 53, (8.23c) 
i=l 
S 9 


We leave the proof of this claim as an exercise, since it is straightforward application of 
results and techniques from previous chapters. For future reference, we define 


(61, 65>, 63) := _ 2762" min(s1, oF yay JQereumin[d2, ô; l4 Jer crn minlds, 53) (8.24) 


corresponding to the probability with which at least one of the bounds in Lemma 8.8 is 
violated. 

In order to apply Theorem 8.5, we need to first show that |||Plll2 < 7 with high probability: 
Beginning with the inequality (8.22a) and applying Lemma 8.8 with 6, = =, 6) = yg and 
63 = 6/16 € (0, 1), we have 


d d 
IPI < = + 8(V¥ +1) Tar ag + lo WF) Joes 
n n 


with probability at least 1 — o(t, a ~). Consequently, as long as 4/45 vel jis ay we have 


C 


Ww 


IIIPIIl2 < 16” +ô < > for all 6 € (0, 72). 


It remains to bound ||f||2. Applying Lemma 8.8 to the inequality (8.22b) with the previously 
specified choices of (61, 62, 63), we have 


Ih <4 [2 sacavval 46 
n n 
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with probability at least 1 — 6(4, an ~.). We have shown that conditions of Theorem 8.5 are 


y? 


satisfied, so that the claim (8.20) follows as a consequence of the bound (8.12). 


8.3 Sparse principal component analysis 


Note that Corollary 8.7 requires that the sample size n be larger than the dimension d in 
order for ordinary PCA to perform well. One might wonder whether this requirement is 
fundamental: does PCA still perform well in the high-dimensional regime n < d? 

The answer to this question turns out to be a dramatic “no”. As discussed at more length 
in the bibliography section, for any fixed signal-to-noise ratio, if the ratio d/n stays suit- 
ably bounded away from zero, then the eigenvectors of the sample covariance in a spiked 
covariance model become asymptotically orthogonal to their population analogs. Thus, the 
classical PCA estimate is no better than ignoring the data, and drawing a vector uniformly at 
random from the Euclidean sphere. Given this total failure of classical PCA, a next question 
to ask is whether the eigenvectors might be estimated consistently using a method more so- 
phisticated than PCA. This question also has a negative answer: as we discuss in Chapter 15, 
for the standard spiked model (8.18), it can be shown via the framework of minimax the- 
ory that no method can produce consistent estimators of the population eigenvectors when 
d/n stays bounded away from zero. See Example 15.19 in Chapter 15 for the details of this 
minimax lower bound. 

In practice, however, it is often reasonable to impose structure on eigenvectors, and this 
structure can be exploited to develop effective estimators even when n < d. Perhaps the 
simplest such structure is that of sparsity in the eigenvectors, which allows for both effective 
estimation in high-dimensional settings, as well as increased interpretability. Accordingly, 
this section is devoted to the sparse version of principal component analysis. 


Let us illustrate the idea of sparse eigenanalysis by revisiting the eigenfaces from Exam- 
ple 8.2. 


Example 8.9 (Sparse eigenfaces) We used the images from the Yale Face Database to set 
up a PCA problem in d = 77760 dimensions. In this example, we used an iterative method 
to approximate sparse eigenvectors with at most s = |0.25d| = 19 440 non-zero coefficients. 
In particular, we applied a thresholded version of the matrix power method for computing 
sparse eigenvalues and eigenvectors. (See Exercise 8.5 for exploration of the standard matrix 
power method.) 

Figure 8.5(a) shows the average face (top left image), along with approximations to the 
first 24 sparse eigenfaces. Each sparse eigenface was restricted to have at most 25% of its 
pixels non-zero, corresponding to a savings of a factor of 4 in storage. Note that the sparse 
eigenfaces are more localized than their PCA analogs from Figure 8.2. Figure 8.5(b) shows 
reconstruction using the average face in conjunction with the first 100 sparse eigenfaces, 
which require equivalent storage (in terms of pixel values) to the first 25 regular eigenfaces. & 
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Figure 8.5 Illustration of sparse eigenanalysis for the Yale Face Database. (a) Av- 
erage face (top left image), and approximations to the first 24 sparse eigenfaces, ob- 
tained by a greedy iterative thresholding procedure applied to the eigenvalue power 
method. Eigenfaces were restricted to have at most 25% of their pixels non-zero, cor- 
responding to a 1/4 reduction in storage. (b) Reconstruction based on sparse eigen- 
faces. 


8.3.1 A general deterministic result 


We now turn to the question of how to estimate a maximal eigenvector that is known a priori 
to be sparse. A natural approach is to augment the quadratic objective function underlying 
classical PCA with an additional sparsity constraint or penalty. More concretely, we analyze 
both the constrained problem 


0 € arg max {(4, L oJ} such that ||6||; < R, (8.25a) 


as well as the penalized variant 


G € arg max {(6, £4) — allal} such that al < (gg) = (8.25) 
In our analysis of the constrained version (8.25a), we set R = ||@"||1. The advantage of the 
penalized variant (8.25b) is that the regularization parameter 4, can be chosen without know- 
ledge of the true eigenvector 6°. In both formulations, the matrix E represents some type of 
approximation to the population covariance matrix Ł, with the sample covariance being a 
canonical example. Note that neither estimator is convex, since they involve maximization 
of a positive semidefinite quadratic form. Nonetheless, it is instructive to analyze them in 
order to understand the statistical behavior of sparse PCA, and in the exercises, we describe 
some relaxations of these non-convex programs. 

Naturally, the proximity of @ to the maximum eigenvector 6* of X depends on the pertur- 
bation matrix P := È — X. How to measure the effect of the perturbation? As will become 
clear, much of our analysis of ordinary PCA can be modified in a relatively straightforward 
way so as to obtain results for the sparse version. In particular, a central object in our analysis 
of ordinary PCA was the basic inequality stated in Lemma 8.6: it shows that the perturbation 
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matrix enters via the function 
(A; P) := (A, PA) +2 (A, P6*). 


As with our analysis of PCA, our general deterministic theorem for sparse PCA involves 
imposing a form of uniform control on Y(A; P) as A ranges over all vectors of the form 
6 — 6 with @ € S“'. The sparsity constraint enters in the form of this uniform bound that 
we assume. More precisely, letting y,(n,d) and w,(n, d) be non-negative functions of the 
eigengap v, sample size and dimension, we assume that there exists a universal constant 
co > O such that 


sup [¥(A;P)| < co VIAE + p,n, DIA + y2, DIAIR. (8.26) 
A=0-0* 
|@ll2=1 


As a concrete example, for a sparse version of the spiked PCA ensemble (8.18) with sub- 
Gaussian tails, this condition is satisfied with high probability with y2(n,d) = (v + 1) 22€ 


n 


and W2(n, d) x + 12d This fact will be established in the proof of Corollary 8.12 to follow. 


von 


Theorem 8.10 Given a matrix X with a unique, unit-norm, s-sparse maximal eigen- 
vector © with eigengap vy, let & be any symmetric matrix satisfying the uniform devia- 
tion condition (8.26) with constant co < 4, and 16s Wn, d) < cov. 


(a) For any optimal solution 6 to the constrained program (8.25a) with R = ||6"||,, 


min {|[- e'll, [0 + ele} < ——— Vs g,(n,d). (8.27) 
1- 4co) 


v( 
(b) Consider the penalized program (8.25b) with the regularization parameter lower 


1/4 ons 
bounded as A, 2 4 (2 3) i Wn, d) + 2y,(n, d). Then any optimal solution 8 satis- 
fies the bound 


Vs y(n, d). (8.28) 


A 
) 


min (I6 - e'l, 18+ e'l} < ST 


XR 


Proof We begin by analyzing the constrained estimator, and then describe the modifica- 
tions necessary for the regularized version. 


Argument for constrained estimator: Note that ial < R = ||@ ||, by construction of the 
estimator, and moreover 6. = 0 by assumption. By splitting the €,-norm into two compo- 
nents, indexed by S and S°, respectively, it can be shown? that the error A = 0-— & satisfies 
the inequality ||Agel|; < llÂsllı. So as to simplify our treatment of the regularized estimator, 
let us proceed by assuming only the weaker inequality Agel, < 31|As lt, which implies that 
|All, < 4 Vs\|All>. Combining this inequality with the uniform bound (8.26) on Y, we find 


3 We leave this calculation as an exercise for the reader: helpful details can be found in Chapter 7. 
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that 
[W(A; P)| < co v ÂI +4 Vs p(n, DÂ + 16 s¥5(n, dIAIB. (8.29) 


Substituting back into the basic inequality (8.15) and performing some algebra yields 


S A A 
v [z = co = 167 yin, D) IIIS < 4 Vs g(r. d lb. 


K 


Note that our assumptions imply that x > $(1 — 4c) > 0, so that the bound (8.27) follows 


after canceling a term Alh and rearranging. 


Argument for regularized estimator: We now turn to the regularized estimator (8.25b). With 
the addition of the regularizer, the basic inequality (8.15) now takes the slightly modified 
form 


Va a ~ x A 
zlÂIi — [PAS P)| < anf = lall} < AntlAsll = lAselhif, (8.30) 
where the second inequality follows by the S-sparsity of 6* and the triangle inequality (see 
Chapter 7 for details). 
We claim that the error vector A still satisfies a form of the cone inequality. Let us state 
this claim as a separate lemma. 


Lemma 8.11 Under the conditions of Theorem 8.10, the error vector A=0-6 
satisfies the cone inequality 


Asell < 3llAsll and hence |All, < 4 -Vsl|Allp. (8.31) 


Taking this lemma as given, let us complete the proof of the theorem. Given Lemma 8.11, 
the previously derived upper bound (8.29) on |'¥(A; P)| is also applicable to the regularized 
estimator. Substituting this bound into our basic inequality, we find that 


16 X A 
y f: -co - Kepo) NIB < vs (An +4 gyn, d))IIAllp. 
SS 


Our assumptions imply that x > $(1 — 4c) > 0, from which claim (8.28) follows. 
It remains to prove Lemma 8.11. Combining the uniform bound with the basic inequal- 
ity (8.30) 


0 < v4- co) IIAIR < v(m, DIA + y3, DIAIR + An{IIAsth — sell. 


—— 
>0 


Introducing the shorthand R = Ge =)" * the feasibility of @ and 6" implies that |All; < 2R, 
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and hence 


0 < {p,(n,d) + 2RU;(n, d)} [Alli + An{llAs thi — 1Aselh} 


<4 


314 yA 
< AntsllAslh — 5llâsell1}, 


>| 


and rearranging yields the claim. 


8.3.2 Consequences for the spiked model with sparsity 


Theorem 8.10 is a general deterministic guarantee that applies to any matrix with a sparse 
maximal eigenvector. In order to obtain more concrete results in a particular case, let us 
return to the spiked covariance model previously introduced in equation (8.18), and analyze 
a sparse variant of it. More precisely, consider a random vector x; € R? generated from the 
usual spiked ensemble—namely, as x; £ Vv &,0" + w;, where 6* € S%! is an s-sparse vector, 
corresponding to the maximal eigenvector of X& = cov(x;). As before, we assume that both 
the random variable é; and the random vector w; € R? are independent, each sub-Gaussian 
with parameter 1, in which case we say that the random vector x; € R? has sub-Gaussian 
tails. 


Corollary 8.12 Consider n i.i.d. samples {x;}"_, from an s-sparse spiked covariance 


slogd : 2 9 
== < c min{1, 5} for a sufficiently 


matrix with eigengap v > 0 and suppose that aT 


small constant c > 0. Then for any 6 € (0,1), any optimal solution © to the con- 
strained program (8.25a) with R = ||@ ||, or to the penalized program (8.25b) with 


A, =¢c3 VW +1 { wed + ô}, satisfies the bound 


T a i I 
min (IB - ll, 18 + 6l} < c4 4 l, EOE | forall 6 € (0,1) (8.32) 
y n 


—c2(n/ s) min{6?, v?,v} 


with probability at least 1 — ce 
4 


Proof Letting P = E — X be the deviation between the sample and population covariance 
matrices, our goal is to show that ‘P(-, P) satisfies the uniform deviation condition (8.26). In 
particular, we claim that, uniformly over A € R4, we have 


1 l a 
YAP| g vA +16Vy+T gs solian +2184 Ae, 833) 
n vn 


> E 2 
co pid) W(nd) 


with probability at least 1 — cye~®” mind, Here (c1, €2, C4) are universal constants. Taking 


this intermediate claim as given, let us verify that the bound (8.32) follows as a consequence 
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of Theorem 8.10. We have 
Osw2(n, d 12c) 
synd) _ 12c; slogd | <v frz s) cy, 
n n 


Co y 


using the assumed upper bound on the ratio sbsd for a sufficiently small constant c. Con- 


sequently, the bound for the constrained estimator follows from Theorem 8.10. For the 
penalized estimator, there are a few other conditions to be verified: let us first check that 


lelli < v . Since 6* is s-sparse with ||6*|l = 1, it suffices to have ys < v lissa? or 
ena 1 shod < 1, which follows from our assumptions. Finally, we need to check 


that 2, satisfies the lower bound requirement in Theorem 8.10. We have 


l i 
4R y(n, d) + 2,(n,d) <4v,J— Steed ayra [PE «| 
y n n 


logd 


sa wrid [8 «ol 
n 
Á S$ ~~. 


Àn 


as required. 

It remains to prove the uniform bound (8.33). Recall the decomposition P = Di P; given 
in equation (8.21). By linearity of the function ¥ in its second argument, this decomposition 
implies that ‘P(A; P) = ya (A; P;). We control each of these terms in turn. 


Control of first component: Lemma 8.8 guarantees that | Xag- 1| < + with probability 
at least 1 — 2e7™”. Conditioned on this bound, for any vector of the F A = 6-6 with 
0 e St!) we have 


v R v 
(P(A; P| < — (A, PY = — 1- (0, 0Y < alll. (8.34) 
16 16 
where we have used the fact that 2(1 — (@", 0) Y < 2(1 — (6, ee = ||All5. 
Control of second component: We have 


IPCA; P2)] < 2 Wr{ (A, W) (A, 6°) + (0, A) + (0°, W) (A, 6°} 


2 
< 4 VIJA +2 VIO", me a (8.35) 


The following lemma provides control on the two terms in this upper bound: 
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Lemma 8.13 Under the conditions of Corollary 8.12, we have 


logd 
P iwi EA 5 < ce" ~~ for all 6 € (0,1), and (8.36a) 
n 


P|", w)| = x stent Aah (8.36b) 


A 


We leave the proof of these bounds as an exercise for the reader, since they follow from 
standard results in Chapter 2. Combining Lemma 8.13 with the bound (8.35) yields 


og d 
'P(A; P2)| < IAk +8 Vy + 1 2 + | Alli. (8.37) 


Control of third term: Recalling that P3 = ‘w'w — I, we have 
IPCA; P3)| < (A, P3A)| + 211P36°llollAll. (8.38) 


Our final lemma controls the two terms in this bound: 


Lemma 8.14 Under the conditions of Corollary 8.12, for all 6 € (0, 1), we have 


l 
eaea e = Eö (8.39a) 


c3 -—— 


and 


sup KA, P3A)| < alla + Sa (8.39b) 
AER 


where both inequalities hold with probability greater than 1 — cen 


Combining this lemma with our earlier inequality (8.38) yields the bound 


logd d 
'¥(A; P3)| < A +8 i Vr ei | |All g3 x I|All7. (8.40) 


Finally, combining the bounds (8.34), (8.37) and (8.40) yields the claim (8.33). 


The only remaining detail is the proof of Lemma 8.14. The proof of the tail bound (8.39a) 
is a simple exercise, using the sub-exponential tail bounds from Chapter 2. The proof of the 
bound (8.39b) requires more involved argument, one that makes use of both Exercise 7.10 
and our previous results on estimation of sample covariances from Chapter 6. 

For a constant € > 0 to be chosen, consider the positive integer k := [€v” 


“_', and the 


logd 
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collection of submatrices {(P3)s5,|S| = k}. Given a parameter a € (0,1) to be chosen, a 
combination of the union bound and Theorem 6.5 imply that there are universal constants c; 
and cz such that 


k 2,2 
P [max Ps)sslle > cı Je + ay| < 2er oC, 
= n 


Since log (i) < 2klog(d) < 4év’n, this probability is at most e -48 = e722, as 
long as we set € = a?/8. The result of Exercise 7.10 then implies that 


8 aes 
n 


KA, P3A)| < cadia + iang} for all A € R4, 


with the previously stated probability. Setting œ = 


c} = (20°). 


TN yields the claim (8.39b) with 
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Further details on PCA and its applications can be found in books by Anderson (1984) (cf. 
chapter 11), Jollife (2004) and Muirhead (2008). See the two-volume set by Horn and John- 
son (1985; 1991) for background on matrix analysis, as well as the book by Bhatia (1997) for 
a general operator-theoretic viewpoint. The book by Stewart and Sun (1980) is more specif- 
ically focused on matrix perturbation theory, whereas Stewart (1971) provides perturbation 
theory in the more general setting of closed linear operators. 

Johnstone (2001) introduced the spiked covariance model (8.18), and investigated the 
high-dimensional asymptotics of its eigenstructure; see also the papers by Baik and Silver- 
stein (2006) and Paul (2007) for high-dimensional asymptotics. Johnstone and Lu (2009) 
introduced the sparse variant of the spiked ensemble, and proved consistency results for a 
simple estimator based on thresholding the diagonal entries of the sample covariance ma- 
trix. Amini and Wainwright (2009) provided a more refined analysis of this same estimator, 
as well as of a semidefinite programming (SDP) relaxation proposed by ď’Asprémont et 
al. (2007). See Exercise 8.8 for the derivation of this latter SDP relaxation. The non-convex 
estimator (8.25a) was first proposed by Joliffe et al. (2003), and called the SCOTLASS cri- 
terion; Witten et al. (2009) derive an alternating algorithm for finding a local optimum of 
this criterion. Other authors, including Ma (2010; 2013) and Yuan and Zhang (2013), have 
studied iterative algorithms for sparse PCA based on combining the power method with soft 
or hard thresholding. 

Minimax lower bounds for estimating principal components in various types of spiked en- 
sembles can be derived using techniques discussed in Chapter 15. These lower bounds show 
that the upper bounds obtained in Corollaries 8.7 and 8.12 for ordinary and sparse PCA, 
respectively, are essentially optimal. See Birnbaum et al. (2012) and Vu and Lei (2012) 
for lower bounds on the ¢,-norm error in sparse PCA. Amini and Wainwright (2009) de- 
rived lower bounds for the problem of variable selection in sparse PCA. Some of these 
lower bounds are covered in this book: in particular, see Example 15.19 for minimax lower 
bounds on f>-error in ordinary PCA, Example 15.20 for lower bounds on variable selec- 
tion in sparse PCA, and Exercise 15.16 for f2-error lower bounds on sparse PCA. Berthet 


256 Principal component analysis in high dimensions 


and Rigollet (2013) derived certain hardness results for the problem of sparse PCA detec- 
tion, based on relating it to the (conjectured) average-case hardness of the planted k-clique 
problem in Erdés—Rényi random graphs. Ma and Wu (2013) developed a related but distinct 
reduction, one which applies to a Gaussian detection problem over a family of sparse-plus- 
low-rank matrices. See also the papers (Wang et al., 2014; Cai et al., 2015; Gao et al., 2015) 
for related results using the conjectured hardness of the k-clique problem. 


8.5 Exercises 


Exercise 8.1 (Courant—Fischer variational representation) For a given integer j € {2,...,d}, 
let Vj_; denote the collection of all subspaces of dimension j-— 1. For any symmetric matrix 
Q, show that the jth largest eigenvalue is given by 


(OV thi Qu), 8.41 
Wve mir. es OY) (8.41) 


where Y+ denotes the orthogonal subspace to Y. 


Exercise 8.2 (Unitarily invariant matrix norms) For positive integers dı < dz, a matrix 
norm on R“*@ is unitarily invariant if |M] = ||| VMU|l| for all orthonormal matrices V € 
R&xd and U e RY, 


(a) Which of the following matrix norms are unitarily invariant? 


(i) The Frobenium norm |||M|llF- 

(ii) The nuclear norm |||Mllhuc- 
Gii) The £,-operator norm ||M]l2 = sup,,),-1 Mull. 
(iv) The €..-operator norm |[M|ll.o = SUPjj,j),,=1 Mullo. 


(b) Let p be a norm on R“' that is invariant to permutations and sign changes—that is 


PQisiascny Me SC ans -> Zdi a) 


for all binary strings z € {-1, 1}“ and permutations z on {1,...,d,}. Such a function is 
known as a symmetric gauge function. Letting {0 (MD } denote the singular values of 
M, show that 


IMI = p(o1(M),...,7a,(M) ) 
aa 
o(M)eR“ 
defines a matrix norm. (Hint: For any pair of dı x d) matrices M and N, we have 
trace(N™M) < (o (N), o(M)), where o(M) denotes the ordered vector of singular val- 


ues.) 
(c) Show that all matrix norms in the family from part (b) are unitarily invariant. 


Exercise 8.3 (Weyl’s inequality) Prove Weyl’s inequality (8.9). (Hint: Exercise 8.1 may 
be useful.) 


Exercise 8.4 (Variational characterization of eigenvectors) Show that the orthogonal ma- 
trix V € R?” maximizing the criterion (8.2) has columns formed by the top r eigenvectors 
of È = cov(X). 
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Exercise 8.5 (Matrix power method) Let Q € S““ be a strictly positive definite symmetric 
matrix with a unique maximal eigenvector 6*. Given some non-zero initial vector 6° € Rf, 


consider the sequence {6'}*), 


git! = Qe : 
Q6'llz 
(a) Prove that there is a large set of initial vectors 6° for which the sequence {6 Yo converges 
to &. 
(b) Give a “bad” initialization for which this convergence does not take place. 
(c) Based on part (b), specify a procedure to compute the second largest eigenvector, as- 
suming it is also unique. 


(8.42) 


Exercise 8.6 (PCA for Gaussian mixture models) Consider an instance of the Gaussian 
mixture model from Example 8.3 with equal mixture weights (œ = 0.5) and unit-norm mean 
vector (||@*||2 = 1), and suppose that we implement the PCA-based estimator 8 for the mean 
vector 6. 


(a) Prove that if the sample size is lower bounded as n > c10? (1 + +0°)d for a sufficiently 
large constant c;, this estimator satisfies a bound of the form 


ee d 
0 — @"|lb < oor vive 4 
n 


with high probability. 

(b) Explain how to use your estimator to build a classification rule—that is, a mapping 
xt W(x) € {-1,+1}, where the binary labels code whether sample x has mean —6* or 
+6". 

(c) Does your method still work if the shared covariance matrix is not a multiple of the 
identity? 


Exercise 8.7 (PCA for retrieval from absolute values) Suppose that our goal is to estimate 
an unknown vector 6* € R? based on n i.i.d. samples {(x;, y;)}7_, of the form y; = | (xi, O*) |, 
where x; ~ N(0, I4). This model is a real-valued idealization of the problem of phase re- 
trieval, to be discussed at more length in Chapter 10. Suggest a PCA-based method for 
estimating @* that is consistent in the limit of infinite data. (Hint: Using the pair (x, y), try to 


construct a random matrix Z such that E[Z] = q (~ 8 & +1).) 


Exercise 8.8 (Semidefinite relaxation of sparse PCA) Recall the non-convex problem 
(8.25a), also known as the SCOTLASS estimator. In this exercise, we derive a convex re- 
laxation of the objective, due to d’ Aspremont et al. (2007). 


(a) Show that the non-convex problem (8.25a) is equivalent to the optimization problem 


max trace(ZO) such that trace(@) = 1, Xa lOl < R? and rank(®) = 1, 
ESI i 


where S®%? denotes the cone of symmetric, positive semidefinite matrices. 


258 Principal component analysis in high dimensions 


(b) Dropping the rank constraint yields the convex program 


max trace(ZO) such that trace(@) = 1 and ye lOl < R. 
bes? ? 
What happens when its optimum is achieved at a rank-one matrix? 


Exercise 8.9 (Primal—dual witness for sparse PCA) The SDP relaxation from Exercise 
8.8(b) can be written in the equivalent Lagrangian form 


d 
max trace(ZO) — Àn `; lOl? (8.43) 
OcSed Ti 
trace(®)=1 D= 


Suppose that there exists a vector 8 € R? and a matrix U € R*@ such that 


= bare if 00 # 0, 
jk = 


€[-1,1] otherwise, 


and moreover such that @ is a maximal eigenvector of the matrix E — A,U. Prove that the 
rank-one matrix © = 6 ® @ is an optimal solution to the SDP relaxation (8.43). 


9 


Decomposability and restricted strong convexity 


In Chapter 7, we studied the class of sparse linear models, and the associated use of £1- 
regularization. The basis pursuit and Lasso programs are special cases of a more general 
family of estimators, based on combining a cost function with a regularizer. Minimizing 
such an objective function yields an estimation method known as an M-estimator. The goal 
of this chapter is to study this more general family of regularized M-estimators, and to 
develop techniques for bounding the associated estimation error for high-dimensional prob- 
lems. Two properties are essential to obtaining consistent estimators in high dimensions: 
decomposability of the regularizer, and a certain type of lower restricted curvature condition 
on the cost function. 


9.1 A general regularized /-estimator 


Our starting point is an indexed family of probability distributions {P,,@ € Q}, where 8 
represents some type of “parameter” to be estimated. As we discuss in the sequel, the space 
Q of possible parameters can take various forms, including subsets of vectors, matrices, or— 
in the nonparametric setting to be discussed in Chapters 13 and 14—subsets of regression 
or density functions. Suppose that we observe a collection of n samples Z} = (Z1, .. - , Zn), 
where each sample Z; takes values in some space Z, and is drawn independently according to 
some distribution P. In the simplest setting, known as the well-specified case, the distribution 
P is a member of our parameterized family—say P = P »-—and our goal is to estimate 
the unknown parameter 0*. However, our set-up will also allow for mis-specified models, 
in which case the target parameter 6* is defined as the minimizer of the population cost 
function—in particular, see equation (9.2) below. 

The first ingredient of a general M-estimator is a cost function £,: Q x Z” — R, where 
the value L£,,(6; Z’) provides a measure of the fit of parameter 6 to the data Z7. Its expectation 
defines the population cost function—namely the quantity 


L0) := ELL; Z0]. (9.1) 


Implicit in this definition is that the expectation does not depend on the sample size n, 
a condition which holds in many settings (with appropriate scalings). For instance, it is 
often the case that the cost function has an additive decomposition of the form £,(0; Z7) = 
1 Èi- LG; Zi), where L: Qx Z — R is the cost defined for a single sample. Of course, any 
likelihood-based cost function decomposes in this way when the samples are drawn in an 
independent and identically distributed manner, but such cost functions can also be useful 
for dependent data. 
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Next we define the target parameter as the minimum of the population cost function 
6 = arg min L(6). (9.2) 
BEQ 


In many settings—in particular, when £, is the negative log-likelihood of the data—this 
minimum is achieved at an interior point of Q, in which case 6* must satisfy the zero-gradient 
equation V.L(6*) = 0. However, we do not assume this condition in our general analysis. 

With this set-up, our goal is to estimate 6* on the basis of the observed samples Z} = 
{Z,...,Z,}. In order to do so, we combine the empirical cost function with a regularizer or 
penalty function ®: Q — R. As will be clarified momentarily, the purpose of this regularizer 
is to enforce a certain type of structure expected in 6*. Our overall estimator is based on 
solving the optimization problem 


8 € arg min {L,(0; Zi) + A,B), (9.3) 


where A, > O is a user-defined regularization weight. The estimator (9.3) is known as an 
M-estimator, where the “M” stands for minimization (or maximization). 


Remark: An important remark on notation is needed before proceeding. From here on- 
wards, we will frequently adopt L£,(0) as a shorthand for L,,(@; Z}), remembering that the 
subscript n reflects implicitly the dependence on the underlying samples. We also adopt the 
same notation for the derivatives of the empirical cost function. 


Let us illustrate this set-up with some examples. 


Example 9.1 (Linear regression and Lasso) We begin with the problem of linear regression 
previously studied in Chapter 7. In this case, each sample takes the form Z; = (x;, yi), where 
x; € Rf is a covariate vector, and y; € R is a response variable. In the simplest case, we 
assume that the data are generated exactly from a linear model, so that y; = (xi, 6°) + wi, 
where w; is some type of stochastic noise variable, assumed to be independent of x;. The 
least-squares estimator is based on the quadratic cost function 


n 


1 1 2 1 3 
„(0) = — 2 si - i, 0) = zly — X9, 
£,(0) = — 2, 501- (x 8)” = =y - Xalk 
where we recall from Chapter 7 our usual notation for the vector y € R” of response variables 
and design matrix X € R”*!, When the response—covariate pairs (y;, x;) are drawn from a 
linear model with regression vector 6*, then the population cost function takes the form 


1 Te ha res er | reece 
El 50 — (x, OY] = 58-8 )"X(0 — 0°) + g =5ll VE (0 = E+ 50 


where E := cov(x;) and o? := var(w,). Even when the samples are not drawn from a 
linear model, we can still define 6* as a minimizer of the population cost function 0 > 
Ex [O — (x, 6))?]. In this case, the linear function x +» (x, 6") provides the best linear 
approximation of the regression function x +> E[y | x]. 

As discussed in Chapter 7, there are many cases in which the target regression vector 6* 
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is expected to be sparse, and in such settings, a good choice of regularizer ® is the f,-norm 
(6) = we |6;|. In conjunction with the least-squares loss, we obtain the Lasso estimator 


n d 
2: 1 J 
in 4 5> i — (Xi, n j 4 
Te aremin| >. D0 (xi, OY +A del (9.4) 
as a special case of the general estimator (9.3). See Chapter 7 for an in-depth analysis of this 
particular M-estimator. 4 


As our first extension of the basic Lasso (9.4), we now consider a more general family of 
regression problems. 


(a) (b) (c) 


Figure 9.1 Illustration of unit balls of different norms in R3. (a) The ¢,-ball gener- 
ated by ®(0) = S |8 ;l. (b) The group Lasso ball generated by ®(@) = 4/6 + 62 + 


183|. (c) A group Lasso ball with overlapping groups, generated by ®(6) = , A + + 


JE +. 


Example 9.2 (Generalized linear models and ¢,-regularization) We again consider samples 
of the form Z; = (x;, yi) where x; € Rf is a vector of covariates, but now the response variable 
y; is allowed to take values in an arbitrary space Y. The previous example of linear regression 
corresponds to the case Y = R. A different example is the problem of binary classification, 
in which the response y; represents a class label belonging to Y = {0,1}. For applications 
that involve responses that take on non-negative integer values—for instance, photon counts 
in imaging applications—the choice Y = {0, 1,2, ...} is appropriate. 

The family of generalized linear models, or GLMs for short, provides a unified approach 
to these different types of regression problems. Any GLM is based on modeling the condi- 
tional distribution of the response y € Y given the covariate x € R? in an exponential family 
form, namely as 


(9.5) 


Oo = oO 
Pra) = h0) pf 2E y- Wx Dy: 


c(o) 
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where c() is a scale parameter, and the function y: R — R is the partition function of the 
underlying exponential family. 

Many standard models are special cases of the generalized linear family (9.5). First, con- 
sider the standard linear model y = (x, 6*) + w, where w ~ N(0,o7). Setting c(o) = o? 
and W(t) = ť/2, the conditional distribution (9.5) corresponds to that of a N(x, 6") , o°) 
variate, as required. Similarly, in the logistic model for binary classification, we assume that 
the log-odds ratio is given by (x, 6*)—that is, 


Pe(y=11x) _ 


f = (x, ©). 9.6 
CE Han de 


log 


This assumption again leads to a special case of the generalized linear model (9.5), this 
time with c(o7) = 1 and y(t) = log(1 + exp(t)). As a final example, when the response y € 
{0,1,2,...} represents some type of count, it can be appropriate to model y as conditionally 
Poisson with mean u = eœ“. This assumption leads to a generalized linear model (9.5) 
with Y(t) = exp(t) and c(o) = 1. See Exercise 9.3 for verification of these properties. 

Given n samples from the model (9.5), the negative log-likelihood takes the form 


1 Š 1x 
La) = 5 9) Hb ) (7 Yr 6). (9.7) 


Here we have rescaled the log-likelihood by 1/n for later convenience, and also dropped the 
scale factor c(), since it is independent of 9. When the true regression vector 6* is expected 
to be sparse, then it is again reasonable to use the ¢-norm as a regularizer, and combining 
with the cost function (9.7) leads to the generalized linear Lasso 


oe 1 n 1 n 
OE in { — i, 0) —(- Xj, 0} + AMAL p- 9.8 
emini Sue D- (5 yan e) i (9.8) 
When y(t) = 17/2, this objective function is equivalent to the standard Lasso, apart from the 
constant term + X}; y? that has no effect on 6. & 


Thus far, we have discussed only the €;-norm. There are various extensions of the €;-norm 
that are based on some type of grouping of the coefficients. 


Example 9.3 (Group Lasso) Let G = {g1,...,g7} be a disjoint partition of the index set 
{1,...,d}—that is, each group g; is a subset of the index set, disjoint from every other group, 
and the union of all T groups covers the full index set. See panel (a) in Figure 9.3 for an 
example of a collection of overlapping groups. 

For a given vector 6 € R, we let 6, denote the d-dimensional vector with components 
equal to 6 on indices within g, and zero in all other positions. For a given base norm ||- ||, we 
then define the group Lasso norm 


DO) := $ [6c (9.9) 
BG 


The standard form of the group Lasso uses the f)-norm as the base norm, so that we ob- 
tain a block ¢,/€:-norm—namely, the ¢,-norm of the £-norms within each group. See Fig- 
ure 9.1(b) for an illustration of the norm (9.9) with the blocks g, = {1,2} and g2 = {3}. The 
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block ¢;/€..-version of the group Lasso has also been studied extensively. Apart from the 
basic group Lasso (9.9), another variant involves associating a positive weight w, with each 


group. & 


In the preceding example, the groups were non-overlapping. The same regularizer (9.9) 
can also be used in the case of overlapping groups; it remains a norm as long as the groups 
cover the space. For instance, Figure 9.1(c) shows the unit ball generated by the overlap- 
ping groups gı = {1,2} and g2 = {1,3} in R?. However, the standard group Lasso (9.9) 
with overlapping groups has a property that can be undesirable. Recall that the motivation 
for group-structured penalties is to estimate parameter vectors whose support lies within a 
union of a (relatively small) subset of groups. However, when used as a regularizer in an 
M-estimator, the standard group Lasso (9.9) with overlapping groups typically leads to so- 
lutions with support contained in the complement of a union of groups. For instance, in the 
example shown in Figure 9.1(c) with groups gı = {1,2} and g2 = {1,3}, apart from the 
all-zero solution that has empty support set, or a solution with the complete support {1, 2, 3}, 
the penalty encourages solutions with supports equal to either gi = {3} or g5 = {2}. 


Standard vs overlap group norms 


Regularizer value 
> 
= 


— Standard 
0.04777" Overlap 


—1.0 —0.5 0.5 1.0 


0.0 
Value of 03 


(a) (b) 


Figure 9.2 (a) Plots of the residual penalty f(@3) = ®(1, 1,63) — (1, 1,0) for the 
standard group Lasso (9.9) with a solid line and overlap group Lasso (9.10) with a 
dashed line, in the case of the groups gı = {1,2} and g2 = {1,3}. (b) Plot of the unit 
ball of the overlapping group Lasso norm (9.10) for the same groups as in panel (a). 


Why is this the case? In the example given above, consider a vector @ € R? such that 
0ı, a variable shared by both groups, is active. For concreteness, say that 6; = 0) = 1, and 
consider the residual penalty f(63) := ®(, 1, 63) — ®(1, 1,0) on the third coefficient. It takes 
the form 


f(s) = 110, Dib + IC, ll = IG, Dll - 1, Ol = 4/1 + 65 - 1. 


As shown by the solid curve in Figure 9.2(a), the function f is differentiable at 6; = 0. In- 
deed, since f'(s)| nei 0, this penalty does not encourage sparsity of the third coefficient. 
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(b) 


Figure 9.3 (a) Group Lasso penalty with non-overlapping groups. The groups 
{21, 22, 23} form a disjoint partition of the index set {1,2,...,d}. (b) A total of d = 7 
variables are associated with the vertices of a binary tree, and sub-trees are used to 
define a set of overlapping groups. Such overlapping group structures arise naturally 
in multiscale signal analysis. 


A similar argument applies with the roles of 60 and 63 reversed. Consequently, if the shared 
first variable is active in an optimal solution, it is usually the case that the second and third 
variables will also be active, leading to a fully dense solution. See the bibliographic discus- 
sion for references that discuss this phenomenon in greater detail. 


The overlapping group Lasso is a closely related but different penalty that is designed to 
overcome this potentially troublesome issue. 


Example 9.4 (Overlapping group Lasso) As in Example 9.3, consider a collection of 
groups G = {g,,...,gr}, where each group is a subset of the index set {1,...,d}. We re- 
quire that the union over all groups covers the full index set, but we allow for overlaps 
among the groups. See panel (b) in Figure 9.3 for an example of a collection of overlapping 
groups. 

When there actually is overlap, any vector 0 has many possible group representations, 
meaning collections {w,, g € G} such that seg Wg = 0. The overlap group norm is based 
on minimizing over all such representations, as follows: 


Poal) := inf 4 Ihre - (9.10) 
gg  (8E€G 
Wg, 8EG 
As we verify in Exercise 9.1, the variational representation (9.10) defines a valid norm on 
IR“. Of course, when the groups are non-overlapping, this definition reduces to the previous 
one (9.9). Figure 9.2(b) shows the overlapping group norm (9.10) in the special case of the 
groups gı = {1,2} and g2 = {1,3}. Notice how it differs from the standard group Lasso (9.9) 
with the same choice of groups, as shown in Figure 9.1(c). & 


When used as a regularizer in the general M-estimator (9.3), the overlapping group Lasso 
(9.10) tends to induce solution vectors with their support contained within a union of the 
groups. To understand this issue, let us return to the group set g, = {1,2} and g2 = {1,3}, 
and suppose once again that the first two variables are active, say 0; = 6) = 1. The residual 
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penalty on 0; then takes the form 
Ffrue(O) = Dovel, 1, 3) = Povel, 1,0) = inf {Il(@%, Dile + II = œ, 8l} - V2. 


It can be shown that this function behaves like the ;-norm around the origin, so that it tends 
to encourage sparsity in 63. See Figure 9.2(b) for an illustration. 


Up to this point, we have considered vector estimation problems, in which the parameter 
space Q is some subspace of R“. We now turn to various types of matrix estimation problems, 
in which the parameter space is some subset of R““?, the space of all (dı x d)-dimensional 
matrices. Of course, any such problem can be viewed as a vector estimation problem, simply 
by transforming the matrix to a D = dd) vector. However, it is often more natural to retain 
the matrix structure of the problem. Let us consider some examples. 


Example 9.5 (Estimation of Gaussian graphical models) Any zero-mean Gaussian random 
vector with a strictly positive definite covariance matrix X > 0 has a density of the form 


P(x1,...,%¢;@") œ Vdet(O*) e72 27, (9.11) 


where @* = (£)! is the inverse covariance matrix, also known as the precision matrix. In 
many cases, the components of the random vector X = (X1, .. . , X4) satisfy various types of 
conditional independence relationships: for instance, it might be the case that X; is condi- 
tionally independent of X, given the other variables X\;;,,. In the Gaussian case, it is a con- 
sequence of the Hammersley—Clifford theorem that this conditional independence statement 
holds if and only if the precision matrix ©* has a zero in position (j,k). Thus, conditional 
independence is directly captured by the sparsity of the precision matrix. See Chapter 11 
for further details on this relationship between conditional independence, and the structure 
of ©*. 

Given a Gaussian model that satisfies many conditional independence relationships, the 
precision matrix will be sparse, in which case it is natural to use the elementwise ¢,-norm 
P(O) = Dd lOl as a regularizer. Here we have chosen not to regularize the diagonal 
entries, since they all must be non-zero so as to ensure strict positive definiteness. Combining 
this form of £;-regularization with the Gaussian log-likelihood leads to the estimator 


© € arg min, 4 (O, E) - logdet + 4, 2; Onl}. (9.12) 
j+k 

where È = 1 X xix; is the sample covariance matrix. This combination corresponds to 

another special case of the general estimator (9.3), known as the graphical Lasso, which we 


analyze in Chapter 11. + 


The problem of multivariate regression is a natural extension of a standard regression 
problem, which involves scalar response variables, to the vector-valued setting. 


Example 9.6 (Multivariate regression) In a multivariate regression problem, we observe 
samples of the form (z; y) € R? x RT, and our goal is to use the vector of features z; to 
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predict the vector of responses y; € RT. Let Y € R”*T and Z € R"? be matrices with y; and 
Zi, respectively, as their ith row. In the simplest case, we assume that the response matrix Y 
and covariate matrix Z are linked via the linear model 


Y = Z0*+W, (9.13) 


where @* € R?*" is a matrix of regression coefficients, and W € R”*%T is a stochastic noise 
matrix. See Figure 9.4 for an illustration. 


Y Z 


Figure 9.4 Illustration of the multivariate linear regression model: a data set of n 
observations consists of a matrix Y € R”*T of multivariate responses, and a matrix 
Z € R”*%P of covariates, in this case shared across the tasks. Our goal is to estimate 
the matrix @* € R?*? of regression coefficients. 


One way in which to view the model (9.13) is as a collection of T different p-dimensional 
regression problems of the form 


Y. = Z@+ Ws, fort=1,...,T, 


where Y, € R", O*, € R? and W., € R” are the rth columns of the matrices Y, O* and 
W, respectively. One could then estimate each column ©*, separately by solving a standard 
univariate regression problem. 

However, many applications lead to interactions between the different columns of ©*, 
which motivates solving the univariate regression problems in a joint manner. For instance, 
it is often the case that there is a subset of features—that is, a subset of the rows of Q@*— 
that are relevant for prediction in all T regression problems. For estimating such a row- 
sparse matrix, a natural regularizer is the row-wise (2, 1)-norm D(@) := a 0 ;.ll2, where 
©;. € R7 denotes the jth row of the matrix © € k?*". Note that this regularizer is a special 
case of the general group penalty (9.9). Combining this regularizer with the least-squares 
cost, we obtain 


a 1 2 
in 4 —IIY - ZOIŻ illo ¢- 14 
O € arg gd 5, I Olle + An 3 lO; l} (9.14) 


J 


This estimator is often referred to as the multivariate group Lasso, for obvious reasons. The 
underlying optimization problem is an instance of a second-order cone problem (SOCP), 
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and can be solved efficiently by a variety of algorithms; see the bibliography section for 
further discussion. & 


Other types of structure are also possible in multivariate regression problems, and lead to 
different types of regularization. 


Example 9.7 (Overlapping group Lasso and multivariate regression) There is an interest- 
ing extension of the row-sparse model from Example 9.6, one which leads to an instance of 
the overlapping group Lasso (9.10). The row-sparse model assumes that there is a relatively 
small subset of predictors, each of which is active in all of the T tasks. A more flexible 
model allows for the possibility of a subset of predictors that are shared among all tasks, 
coupled with a subset of predictors that appear in only one (or relatively few) tasks. This 
type of structure can be modeled by decomposing the regression matrix ©* as the sum of 
a row-sparse matrix Q* along with an elementwise-sparse matrix I“. If we impose a group 
€; 2-norm on the row-sparse component and an ordinary ¢;-norm on the element-sparse com- 
ponent, then we are led to the estimator 


(QT) € arg „min min { SY - Z(Q + DIE + An 3 1Q).-lle + lth}. (9.15) 


jel 


where An, Un > O are regularization parameters to be chosen. Any solution to this optimiza- 
tion problem defines an estimate of the full regression matrix via 6 =O+T7. 

We have defined the estimator (9.15) as an optimization problem over the matrix pair 
(Q,T), using a separate regularizer for each matrix component. Alternatively, we can for- 
mulate it as a direct estimator for ©. In particular, by making the substitution © = Q +T, 
and minimizing over both © and the pair (©, T°) subject to this linear constraint, we obtain 
the equivalent formulation 


A ey i é 
© € arg nin, { 8 - ZOIR + An f inf lh + ont} (9.16) 


—_ SS 
®over(O) 


where wn = ©. In this direct formulation, we see that the assumed decomposition leads to 


an ieten form of the overlapping group norm. We return to study the estimator (9.16) 
in Section 9.7. + 


In other applications of multivariate regression, one might imagine that the individual 
regression vectors—that is, the columns ©*, € Ik’—all lie within some low-dimensional 
subspace, corresponding to some hidden meta-features, so that it has relatively low rank. 
Many other problems, to be discussed in more detail in Chapter 10, also lead to estimation 
problems that involve rank constraints. In such settings, the ideal approach would be to 
impose an explicit rank constraint within our estimation procedure. Unfortunately, when 
viewed as function on the space of dı x dọ matrices, the rank function is non-convex, so 
that this approach is not computationally feasible. Accordingly, we are motivated to study 
convex relaxations of rank constraints. 


Example 9.8 (Nuclear norm as a relaxation of rank) The nuclear norm provides a natural 
relaxation of the rank of a matrix, one which is analogous to the £,-norm as a relaxation of 
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the cardinality of a vector. In order to define the nuclear norm, we first recall the singular 
value decomposition, or SVD for short, of a matrix @ € R“*“2, Letting d’ = min{d,, dy}, the 
SVD takes the form 


© = UDV', (9.17) 


where U € R“*“ and V € R“*“’ are orthonormal matrices (meaning that UTU = VTV = Ip). 
The matrix D € R’*“ is diagonal with its entries corresponding to the singular values of @, 
denoted by 


o\(O) > 02(O) = 03(O) = --- > ca (O) = 0. (9.18) 


(b) 


Figure 9.5 Illustration of the nuclear norm ball as a relaxation of a rank constraint. 


(a) Set of all matrices of the form © = A ] such that ||Olllnue < 1. This is 


a projection of the unit ball of the nuclear norm ball onto the space of symmetric 
matrices. (b) For a parameter q > 0, the f,-“ball” of matrices is defined by B,(1) = 
{0 c R??? | Ma o (®)4 < 1}. For all q € [0, 1), this is a non-convex set, and it is 
equivalent to the set of all rank-one matrices for q = 0. 


Observe that the number of strictly positive singular values specifies the rank—that is, we 
have rank(@) = ae I[7;(@) > 0]. This observation, though not practically useful on its 
own, suggests a natural convex relaxation of a rank constraint, namely the nuclear norm 


$ 
ll@lluc = >, oO), (9.19) 


j=l 


corresponding to the £1-norm of the singular values.! As shown in Figure 9.5(a), the nuclear 
norm provides a convex relaxation of the set of low-rank matrices. & 


There are a variety of other statistical models—in addition to multivariate regression—in 
which rank constraints play a role, and the nuclear norm relaxation is useful for many of 
them. These problems are discussed in detail in Chapter 10 to follow. 


' No absolute value is necessary, since singular values are non-negative by definition. 
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9.2 Decomposable regularizers and their utility 


Having considered a general family of M-estimators (9.3) and illustrated it with various ex- 
amples, we now turn to the development of techniques for bounding the estimation error 
0-6". The first ingredient in our analysis is a property of the regularizer known as decom- 
posability. It is a geometric property, based on how the regularizer behaves over certain pairs 
of subspaces. The ¢\-norm is the canonical example of a decomposable norm, but various 
other norms also share this property. Decomposability implies that any optimum @ to the 
M-estimator (9.3) belongs to a very special set, as shown in Proposition 9.13. 

From here onwards, we assume that the set Q is endowed with an inner product (+, -), and 
we use ||- || to denote the norm induced by this inner product. The standard examples to keep 
in mind are 


e the space R? with the usual Euclidean inner product, or more generally with a weighted 
Euclidean inner product, and 


e the space R“*” equipped with the trace inner product (10.1). 


Given a vector 8 € Q and a subspace S of Q, we use @s to denote the projection of 8 onto S. 
More precisely, we have 


@s := arg min ||@ — 8l. (9.20) 
BES 


These projections play an important role in the sequel; see Exercise 9.2 for some examples. 


9.2.1 Definition and some examples 


The notion of a decomposable regularizer is defined in terms of a pair of subspaces M c M 
of R°. The role of the model subspace M is to capture the constraints specified by the model; 
for instance, as illustrated in the examples to follow, it might be the subspace of vectors with 
a particular support or a subspace of low-rank matrices. The orthogonal complement of the 
space M, namely the set 


Mt := fve R° | (u,v) =0 forall u € M}, (9.21) 


is referred to as the perturbation subspace, representing deviations away from the model 
subspace M. In the ideal case, we have M+ = M+, but the definition allows for the possibil- 
ity that M is strictly larger than M, so that M+ is strictly smaller than M+. This generality is 
needed for treating the case of low-rank matrices and nuclear norm, as discussed in Chap- 
ter 10. 


Definition 9.9 Given a pair of subspaces M C M, a norm-based regularizer ® is de- 
composable with respect to (M, M+) if 


O(a + B) = O(a) + OB) forall æ € Mand 8 € M+. (9.22) 
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p 


M+ 


Figure 9.6 In the ideal case, decomposability is defined in terms of a subspace pair 
(M, M+). For any a € M and 8 € M+, the regularizer should decompose as ®(a@ + 8) = 
D(a) + D8). 


See Figure 9.6 for the geometry of this definition. In order to build some intuition, let 
us consider the ideal case M = M, so that the decomposition (9.22) holds for all pairs 
(a, p) € M x M+. For any given pair (a, £) of this form, the vector a + 8 can be interpreted 
as perturbation of the model vector œ away from the subspace M, and it is desirable that the 
regularizer penalize such deviations as much as possible. By the triangle inequality for a 
norm, we always have ®(a@ +8) < D(a) + DB), so that the decomposability condition (9.22) 
holds if and only if the triangle inequality is tight for all pairs (@, 8) € (M, M+). It is exactly 
in this setting that the regularizer penalizes deviations away from the model subspace M as 
much as possible. 


Let us consider some illustrative examples: 


Example 9.10 (Decomposability and sparse vectors) We begin with the |-norm, which is 
the canonical example of a decomposable regularizer. Let S be a given subset of the index 
set {1,...,d} and S° be its complement. We then define the model subspace 


M=MS):={@eR*|6,;=0 forall je S‘}, (9.23) 
corresponding to the set of all vectors that are supported on S. Observe that 
M+(S) ={9ER*|6,;=0 forall jeS}. 


With these definitions, it is then easily seen that for any pair of vectors a € M(S) and 
B € M+(S), we have 


la + Alli = lel + lll, 
showing that the £;-norm is decomposable with respect to the pair (M(S), M+(S)). + 


Example 9.11 (Decomposability and group sparse norms) We now turn to the notion of 
decomposability for the group Lasso norm (9.9). In this case, the subspaces are defined in 
terms of subsets of groups. More precisely, given any subset Sg C G of the group index set, 
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consider the set 
M(Sg):={8@€Q|6,=0 forall g ¢ Sg}, (9.24) 


corresponding to the subspace of vectors supported only on groups indexed by Sg. Note 
that the orthogonal subspace is given by M*(Sg) = {0 € Q | @, = 0 for all g € Sg}. Letting 
a € M(S ç) and 8 € M*(S ç) be arbitrary, we have 


O(a +B) = X llagll + $ Igli = Da) + D), 


geSg gESG 


thus showing that the group norm is decomposable with respect to the pair (M(S ç), M+(S ç)). 
& 


In the preceding example, we considered the case of non-overlapping groups. It is natural 
to ask whether the same decomposability—that is, with respect to the pair (M(S ç), M+(S g))— 
continues to hold for the ordinary group Lasso ||4|lg = X geç ||O¢|| when the groups are allowed 
to be overlapping. A little thought shows that this is not the case in general: for instance, in 
the case 6 € R*, consider the overlapping groups gı = {1,2}, g2 = {2,3} and g; = {3,4}. If 
we let Sg = {g1}, then 


M(Sg) = {0 € RÊ | 6, = 6, = 0}. 


The vector œ = [o 1 0 0| belongs to M(S ç), and the vector 8 = [o 0 1 0| belongs 
to M+(S ç). In the case of the group &\/f2-norm |lêllg2 = Deg llOell2, we have |la + Bllg2 = 
1+ v2 +1, but 


llallg2 + W6llg2 =1+1+1+1=4 > 2+ v2, (9.25) 


showing that decomposability is violated. However, this issue can be addressed by a different 
choice of subspace pair, one that makes use of the additional freedom provided by allowing 
for M > M. We illustrate this procedure in the following: 


Example 9.12 (Decomposability of ordinary group Lasso with overlapping groups) As 
before, let Sg be a subset of the group index set G, and define the subspace M(S g). We then 
define the augmented group set 


Se={geGl gn|jnzo}, (9.26) 


heS ç 


corresponding to the set of groups with non-empty intersection with some group in Sg. 
Note that in the case of non-overlapping groups, we have S, g = Sg, whereas S g 2 Sg 
in the more general case of overlapping groups. This augmented set defines the subspace 
M:= M(Sg) > M(S ç), and we claim that the overlapping group norm is decomposable with 
respect to the pair (M(S ç), M+(Sg)). 

Indeed, let œ and £ be arbitrary members of M(S ç) and M+(Sg), respectively. Note that 
any element of M+(Sg) can have support only on the subset U,45, 4; at the same time, this 
subset has no overlap with Uges; g, and any element of M(Sg) is supported on this latter 
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subset. As a consequence of these properties, we have 


læ + Ble = X (e+= J, ae + Be = lale + lille. 


gEG gS G gS g 
as claimed. & 


It is worthwhile observing how our earlier counterexample (9.25) is excluded by the con- 
struction given in Example 9.12. With the groups gı = {1,2}, go = {2,3} and g3 = {3,4}, 
combined with the subset Sg = {g1}, we have So = {g1, 82}. The vector 6B = [o 0 1 o] 
belongs to the subspace 


M*(Sg) = {0 € R |6 = & = 0}, 
but it does not belong to the smaller subspace 
M+(Sg) = {0 € R*| 6, = & = 6 = 0}. 


Consequently, it does not violate the decomposability property. However, note that there is 
a statistical price to be paid by enlarging to the augmented set M(S ç): as our later results 
demonstrate, the statistical estimation error scales as a function of the size of this set. 


As discussed previously, many problems involve estimating low-rank matrices, in which 
context the nuclear norm (9.19) plays an important role. In Chapter 10, we show how the 
nuclear norm is decomposable with respect to appropriately chosen subspaces. Unlike our 
previous examples (in which M = M), in this case we need to use the full flexibility of our 
definition, and choose M to be a strict superset of M. 

Finally, it is worth noting that sums of decomposable regularizers over disjoint sets of 
parameters remain decomposable: that is, if ©; and ®, are decomposable with respect to 
subspaces over Q; and Q; respectively, then the sum ®, + ®2 remains decomposable with 
respect to the same subspaces extended to the Cartesian product space Q; x Q2. For instance, 
this property is useful for the matrix decomposition problems discussed in Chapter 10, which 
involve a pair of matrices A and I, and the associated regularizers ®)(A) = [Allue and 
(1) = |r. 


9.2.2 A key consequence of decomposability 


Why is decomposability important in the context of M-estimation? Ultimately, our goal is 
to provide bounds on the error vector A := 0 - & between any global optimum of the 
optimization problem (9.3) and the unknown parameter 6°. In this section, we show that 
decomposability—in conjunction with a suitable choice for the regularization weight 1,,— 
ensures that the error A must lie in a very restricted set. 

In order to specify a “suitable” choice of regularization parameter J, we need to define 
the notion of the dual norm associated with our regularizer. Given any norm ®: R? — R, its 
dual norm is defined in a variational manner as 


@*(v) := sup (u, v}. (9.27) 
®(u)<1 
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Regularizer ® Dual norm &* 


€j-norm Ọ®(u)= 2i |u jl €.-norm ®*(v) = Ilvllo = max |vjl 
j= 


N 


Group /€)-norm (u) = X geg llugl\p Group fæ /fg-norm O*(v) = ax IIvsllg 
ge 


Non-overlapping groups +-=1 


1 
q 


d 
Nuclear norm @(M) = >) o;(M) €)-operator norm = *(N) = max | oN) 
j=1 j 


NNT 


d = min{d, d2} 


Overlap group norm Overlap dual norm 
Ou) = inf |[Wellp D*(v) = maxgeg |lVellq 
U=Digeg Wg 
Sparse-low-rank decomposition norm Weighted max. norm 
PaM) = inf (IAI + wB} *(N) = max {||Nllmax, 7 "INI 


Table 9.1 Primal and dual pairs of regularizers in various cases. See Exercises 9.4 and 9.5 for verifi- 
cation of some of these correspondences. 


Table 9.1 gives some examples of various dual norm pairs. 

Our choice of regularization parameter is specified in terms of the random vector V_L£,,(6") 
—the gradient of the empirical cost evaluated at 6°, also referred to as the score function. 
Under mild regularity conditions, we have E[VL,,(6"))] = VL(6*). Consequently, when the 
target parameter 6” lies in the interior of the parameter space Q, by the optimality condi- 
tions for the minimization (9.2), the random vector VL,,(6") has zero mean. Under ideal 
circumstances, we expect that the score function will not be too large, and we measure its 
fluctuations in terms of the dual norm, thereby defining the “good event” 


An 
G(A,) := TAZZ < ah. (9.28) 


With this set-up, we are now ready for the statement of the main technical result of this sec- 
tion. The reader should recall the definition of the subspace projection operator (9.20). 


Proposition 9.13 Let £,: Q — R be a convex function, let the regularizer ®: Q > 
[0, co) be a norm, and consider a subspace pair M, M=) over which ® is decomposable. 
Then conditioned on the event G(A,), the error A = 6 — 6° belongs to the set 


Ce (M, MH) := {A € Q | O(Ap.) < 3O(Ag) + 40(6%,,)}. (9.29) 


When the subspaces (M, M+) and parameter 6* are clear from the context, we adopt the 
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shorthand notation C. Figure 9.7 provides an illustration of the geometric structure of the 
set C. To understand its significance, let us consider the special case when & € M, so that 
O = 0. In this case, membership of A in C implies that (Ap) < 3@(Ajy), and hence that 


DA) = O(Ay, + Ap.) < O(Ap) + PAn) < 40(Ay). (9.30) 


Consequently, when measured in the norm defined by the regularizer, the vector Ais only a 
constant factor larger than the projected quantity Ap. . Whenever the subspace M is relatively 
small, this inequality provides significant control on A. 


(An) 


(Ai, A2) @(An-) 


(a) (b) 


Figure 9.7 Illustration of the set Ce (M, M+) in the special case A = (Aj, Ag, A3) € R? 
and regularizer D(A) = ||Al|,, relevant for sparse vectors (Example 9.1). This picture 
shows the case S = {3}, so that the model subspace is M(S) = {A € R? | Ay = Az = 
O}, and its orthogonal complement is given by M+(S) = {A € R? | A; = 0}. (a) In the 
special case when 6; = 6; = 0, so that 6° € M, the set C(M,M*) is a cone, with no 
dependence on 6*. (b) When 6* does not belong to M, the set C(M, M+) is enlarged in 
the coordinates (A4, A2) that span M+. It is no longer a cone, but is still a star-shaped 
set. 


We now turn to the proof of the proposition: 
Proof Our argument is based on the function F : Q — R given by 
F(A) := La + A) — Lr (0) + An [O + A) — B(6")}. (9.31) 


By construction, we have F (0) = 0, and so the optimality of @ implies that the error vector 
A = 0 — & must satisfy the condition F (A) < 0, corresponding to a basic inequality in this 
general setting. Our goal is to exploit this fact in order to establish the inclusion (9.29). In 
order to do so, we require control on the two separate pieces of F, as summarized in the 
following: 
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Lemma 9.14 (Deviation inequalities) For any decomposable regularizer and param- 
eters © and A, we have 


OO + A) — O(6"*) > D(Ap) — (Ay) — 200). (9.32) 
Moreover, for any convex function Ln, conditioned on the event G(A,), we have 
an 
Lu + A) = Ly(G) = ~F [O (As) + © (Ag) |. (9.33) 
p 


Given this lemma, the claim of Proposition 9.13 follows immediately. Indeed, combining 
the two lower bounds (9.32) and (9.33), we obtain 


oe Àn 
0 > F(A) > An { Ogu) = PAn) = 200) = F O (An) + © (Ar) 
Àn f 
= AOA) - 3@(Ax) — 4(65,.)}, 


from which the claim follows. 
Thus, it remains to prove Lemma 9.14, and here we exploit decomposability of the regu- 
larizer. Since D(@* + A) = © (o; + Oa + Ay + Apu), applying the triangle inequality yields 


D (6° + A) > D (0 + Ap) — B (Gf, + Ap) = O (Gf, + Ap.) — © (0%) — (Ap). 


By decomposability applied to 6,, and Ay, we have © (o + Apa) = 0 (6;,) + D (Ag), so 
that 


D(F + A) > O(G;,) + D (Agu) — © (G5) — O (Ag). (9.34) 


Similarly, by the triangle inequality, we have D(6*) < ® (0) + @ (Gade Combining this 
inequality with the bound (9.34), we obtain 


© (6° + A) — O(6") > D (6) + D (Apu) — D (0) — D (Ap) — {D (0) + D (0) } 
= © (Apu) — ® (Ag) — 20 (6%), 


which yields the claim (9.32). 
Turning to the cost difference, using the convexity of the cost function L,, we have 


LO + A) — Li") = VLG"), A) = KVL), A). 
Applying the Holder inequality with the regularizer and its dual (see Exercise 9.7), we have 
An 
KVL£n(6"), A) < DVL BA) < > [® (An) + ® (Api) ], 
where the final step uses the triangle inequality, and the assumed bound 4, > 20*(VL,,(6")). 


Putting together the pieces yields the claimed bound (9.33). This completes the proof of 
Lemma 9.14, and hence the proof of the proposition. 
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9.3 Restricted curvature conditions 


We now turn to the second component of a general framework, which concerns the curvature 
of the cost function. Before discussing the general high-dimensional setting, let us recall the 
classical role of curvature in maximum likelihood estimation, where it enters via the Fisher 
information matrix. Under i.i.d. sampling, the principle of maximum likelihood is equivalent 
to minimizing the cost function 


1 n 
L£n(0) := -> 2 log Pa(z)). (9.35) 


The Hessian of this cost function V7L£,,(@) is the sample version of the Fisher information 
matrix; as the sample size n increases to infinity with d fixed, it converges in a pointwise 
sense to the population Fisher information V? £(@). Recall that the population cost function 
L was defined previously in equation (9.1). The Fisher information matrix evaluated at 6* 
provides a lower bound on the accuracy of any statistical estimator via the CramérRao 
bound. As a second derivative, the Fisher information matrix V? L(6*) captures the curvature 
of the cost function around the point 6”. 


Figure 9.8 Illustration of the cost function 6 +> L,(0; Z7). In the high-dimensional 
setting (d > n), although it may be curved in certain directions (e.g., Agooa), there are 
d —n directions in which it is flat up to second order (e.g., Avaa). 


In the high-dimensional setting, the story becomes a little more complicated. In particu- 
lar, whenever n < d, then the sample Fisher information matrix V? L,,(6*) is rank-degenerate. 
Geometrically, this rank degeneracy implies that the cost function takes the form shown in 
Figure 9.8: while curved upwards in certain directions, there are d — n directions in which 
it is flat up to second order. Consequently, the high-dimensional setting precludes any type 
of uniform lower bound on the curvature, and we can only hope to obtain some form of re- 
stricted curvature. There are several ways in which to develop such notions, and we describe 
two in the sections to follow, the first based on lower bounding the error in the first-order 
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Taylor-series expansion, and the second by directly lower bounding the curvature of the 
gradient mapping. 


9.3.1 Restricted strong convexity 


We begin by describing the notion of restricted strong convexity, which is defined by the 
Taylor-series expansion. Given any differentiable cost function, we can use the gradient to 
form the first-order Taylor approximation, which then defines the first-order Taylor-series 
error 


E,(A) = La + A) — Li) — (VL), A). (9.36) 


Whenever the function 6 +>» L£,,(@) is convex, this error term is always guaranteed to be non- 
negative.’ Strong convexity requires that this lower bound holds with a quadratic slack: in 
particular, for a given norm || - ||, the cost function is locally x-strongly convex at 0* if the 
first-order Taylor error is lower bounded as 


&,(A) > SIAP (9.37) 


for all A in a neighborhood of the origin. As previously discussed, this notion of strong 
convexity cannot hold for a generic high-dimensional problem. But for decomposable regu- 
larizers, we have seen (Proposition 9.13) that the error vector must belong to a very special 
set, and we use this fact to define the notion of restricted strong convexity. 


Definition 9.15 For a given norm ||- || and regularizer ®(-), the cost function satisfies 
a restricted strong convexity (RSC) condition with radius R > 0, curvature x > 0 and 
tolerance 72 if 


E,(A) > 5 lial? -77 (A) forall A € B(R). (9.38) 
b Y 


To clarify a few aspects of this definition, the set B(R) is the unit ball defined by the given 
norm ||- ||. In our applications of RSC, the norm ||- || will be derived from an inner product 
on the space Q. Standard cases include the usual Euclidean norm on Rf, and the Frobenius 
norm on the matrix space R“*”. Various types of weighted quadratic norms also fall within 
this general class. 

Note that, if we set the tolerance term T? = 0, then the RSC condition (9.38) is equivalent 
to asserting that £, is locally strongly convex in a neighborhood of 6* with coefficient x. As 
previously discussed, such a strong convexity condition cannot hold in the high-dimensional 
setting. However, given our goal of proving error bounds on M-estimators, we are not inter- 
ested in all directions, but rather only the directions in which the error vector A = 6-6" can 
lie. For decomposable regularizers, Proposition 9.13 guarantees that the error vector must 
lie in the very special “cone-like” sets Cy-(M, M+). Even with a strictly positive tolerance 
TŽ > 0, an RSC condition of the form (9.38) can be used to guarantee a lower curvature over 


2 Indeed, for differentiable functions, this property may be viewed as an equivalent definition of convexity. 
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this restricted set, as long as the sample size is sufficiently large. We formalize this intuition 
after considering a few concrete instances of Definition 9.15. 


Example 9.16 (Restricted eigenvalues for least-squares cost) In this example, we show 
how the restricted eigenvalue conditions (see Definition 7.12 in Chapter 7) correspond 
to a special case of restricted strong convexity. For the least-squares objective £,(0) = 


+lly — X6||3, an easy calculation yields that the first-order Taylor error is given by &,(A) = 
IXAIL3 
2n 


. A restricted strong convexity condition with the €;-norm then takes the form 


IXA 
2n 
For various types of sub-Gaussian matrices, bounds of this form hold with high probability 
for the choice 7? = ed Theorem 7.16 in Chapter 7 provides one instance of such a result. 
As a side remark, this example shows that the least-squares objective is special in two 
ways: the first-order Taylor error is independent of 6* and, moreover, it is a positively homo- 
geneous function of degree two—that is, &,(tA) = ?°&,(A) for all t € R. The former property 
implies that we need not be concerned about uniformity in 6*, whereas the latter implies that 
it is not necessary to localize A to a ball BCR). 4 


> sllall TIAI for all A € R°. (9.39) 


Later in Section 9.8, we provide more general results, showing that a broader class of cost 
functions satisfy a restricted strong convexity condition of the type (9.39). Let us consider 
one example here: 


Example 9.17 (RSC for generalized linear models) Recall the family of generalized linear 
models from Example 9.2, and the cost function (9.7) defined by the negative log-likelihood. 
Suppose that we draw n i.i.d. samples, in which the covariates {x;}?_, are drawn from a zero- 
mean sub-Gaussian distribution with non-degenerate covariance matrix X. As a consequence 
of a result to follow (Theorem 9.36), the Taylor-series error of various GLM log-likelihoods 
satisfies a lower bound of the form 


logd 
n 


&n(A) = SIAI =c] Al; for all [Alp < 1 (9.40) 


with probability greater than 1 — c2 exp(—c3n). 
Theorem 9.36 actually provides a more general guarantee in terms of the quantity 


ye Eelo Sax 
Hin") := ao [Ze], (9.41) 


i=1 


where ®* denotes the dual norm, and {é;}_, is a sequence of i.i.d. Rademacher variables. 
With this notation, we have 


EA) > ŽIAB — ci (0 DXA) forall Al < 1 (9.42) 


with probability greater than 1 — c2 exp(—c3n). This result is a generalization of our previous 


bound (9.40), since u, (®*) x wed in the case of ¢,-regularization. 
In Exercise 9.8, we bound the quantity (9.41) for various norms. For group Lasso with 
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group set G and maximum group size m, we show that 


Hl") S qc + JE a (9.43a) 


whereas for the nuclear norm for dı X də matrices, we show that 


Ln(®") X Je ¥ Je (9.43b) 
n n 


We also show how these results, in conjunction with the lower bound (9.42), imply suitable 
forms of restricted convexity as long as the sample size is sufficiently large. 4 


We conclude this section with the definition of one last geometric parameter that plays 
an important role. As we have just seen, in the context of f|-regularization and the RE con- 
dition, the cone constraint is very useful; in particular, it implies that ||A||; < 4 VsllAll2, a 
bound used repeatedly in Chapter 7. Returning to the general setting, we need to study how 
to translate between ®(Ay)) and ||Ay|| for an arbitrary decomposable regularizer and error 


norm. 


Definition 9.18 (Subspace Lipschitz constant) For any subspace S of Rf, the subspace 
Lipschitz constant with respect to the pair (Q, ||- ||) is given by 


Y(S) := sup au) 


4 (9.44) 
ueS\{0} llull 


To clarify our terminology, this quantity is the Lipschitz constant of the regularizer with re- 
spect to the error norm, but as restricted to the subspace S. It corresponds to the worst-case 
price of translating between the ®- and || - ||-norms for any vector in S. 


To illustrate its use, let us consider it in the special case when 6* € M. Then for any 
A € C (M, M+), we have 


(i) (ii) (iii) = 
P(A) < D(Ay) + D(Ap) < 40(Ag,) < 4 POMMIA, (9.45) 


where step (i) follows from the triangle inequality, step (ii) from membership in C(M, M+), 
and step (iii) from the definition of ‘P( M). 

As a simple example, if M is a subspace of s-sparse vectors, then with regularizer ®(u) = 
llullı and error norm |lu|| = llul, we have ¥(M) = vys. In this way, we see that inequal- 
ity (9.45) is a generalization of the familiar inequality ||All < 4 VsI|All; in the context of 
sparse vectors. The subspace Lipschitz constant appears explicitly in the main results, and 
also arises in establishing restricted strong convexity. 


9.4 Some general theorems 


Thus far, we have discussed the notion of decomposable regularizers, and some related no- 
tions of restricted curvature for the cost function. In this section, we state and prove some 
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results on the estimation error, namely, the quantity 0-6", where @ denotes any optimum of 
the regularized M-estimator (9.3). 


9.4.1 Guarantees under restricted strong convexity 


We begin by stating and proving a general result that holds under the restricted strong con- 
vexity condition given in Section 9.3.1. Let us summarize the assumptions that we impose 
throughout this section: 


(Al) The cost function is convex, and satisfies the local RSC condition (9.38) with curvature 
K, radius R and tolerance TA with respect to an inner-product induced norm || - ||. 
(A2) There is a pair of subspaces M c M such that the regularizer decomposes over (M, M+). 


We state the result as a deterministic claim, but conditioned on the “good” event 


An 
G(A,) := (VLE < ah. (9.46) 
Our bound involves the quantity 
20 ML a Bie S * 202 0* 
e (M, M+) := 9 = PM) + -= (AnD) + 16720 O (9.47) 
K K 

—— Å Å— mc 
estimation error approximation error 


which depends on the choice of our subspace pair (M, M+). 


Theorem 9.19 (Bounds for general models) Under conditions (A1) and (A2), con- 
sider the regularized M-estimator (9.3) conditioned on the event G(A,), 


(a) Any optimal solution satisfies the bound 
DE- 6") < 4 [PODIO 6"|| + OE, (9.48a) 
(b) For any subspace pair (M, M+) such that TPM) < and E„(M, M+) < R, we have 
I6- 6È < (M, M+). (9.48b) 


It should be noted that Theorem 9.19 is actually a deterministic result. Probabilistic condi- 
tions enter in certifying that the RSC condition holds with high probability (see Section 9.8), 
and in verifying that, for a concrete choice of regularization parameter, the dual norm bound 
A, = 20*(VL,(6")) defining the event G(A,,) holds with high probability. The dual norm 
bound cannot be explicitly verified, since it presumes knowledge of 6”, but it suffices to give 
choices of 4, for which it holds with high probability. We illustrate such choices in various 
examples to follow. 

Equations (9.48a) and (9.48b) actually specify a family of upper bounds, one for each sub- 
space pair (M, M+) over which the regularizer ® decomposes. The optimal choice of these 
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subspaces serves to trade off the estimation and approximation error terms in the bound. The 
upper bound (9.48b) corresponds to an oracle inequality, since it applies to any parameter 6”, 
and gives a family of upper bounds involving two sources of error. The term labeled “esti- 
mation error” represents the statistical cost of estimating a parameter belong to the subspace 
M c M; naturally, it increases as M grows. The second quantity represents “approximation 
error” incurred by estimating only within the subspace M, and it shrinks as M is increased. 
Thus, the optimal bound is obtained by choosing the model subspace to balance these two 
types of error. We illustrate such choices in various examples to follow. 


In the special case that the target parameter 6* is contained within a subspace M, Theo- 
rem 9.19 has the following corollary: 


Corollary 9.20 Suppose that, in addition to the conditions of Theorem 9.19, the opti- 
mal parameter © belongs to M. Then any optimal solution @ to the optimization prob- 
lem (9.3) satisfies the bounds 


D0- 6") < 6427, (9.49a) 


Pa p 
l6- l? < 9 ZPM). (9.49b) 
K 
d 


This corollary can be applied directly to obtain concrete estimation error bounds for many 
problems, as we illustrate in the sequel. 


We now turn to the proof of Theorem 9.19. 


Proof We begin by proving part (a). Letting A = 6-6" be the error, by the triangle in- 
equality, we have 


@(A) < OAs) + O(Ay:) 
< bp) + [3@(Ay) + 40(6;,.)} 
Z YOD O- l+ DER), 


where inequality (i) follows from Proposition 9.13 under event G(,,) and inequality (ii) fol- 
lows from the definition of the optimal subspace constant. 


Turning to the proof of part (b), in order to simplify notation, we adopt the shorthand C 
for the set Ce (M, M+). Letting ô € (0, R] be a given error radius to be chosen, the following 
lemma shows that it suffices to control the sign of the function F from equation (9.31) over 
the set K(5) := C N {||Al| = ô}. 
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Lemma 9.21 If F(A) > 0 for all vectors A € K(6), then \|Al| <ô. ) 


Proof We prove the contrapositive statement: in particular, we show that if for some op- 
timal solution 6, the associated error vector A = e- @ satisfies the inequality ||Al| > ô, 
then there must be some vector A € K(6) such that F(A) < 0. If |A|] > ô, then_ since C is 
star-shaped around the origin (see the Appendix, Section 9.9), the line joining A to 0 must 
intersect the set K(6) at some intermediate point of the form f*A for some f* € [0,1]. See 
Figure 9.9 for an illustration. 


Figure 9.9 Geometry of the proof of Lemma 9.21. When AIl > 6 and the set C 
is star-shaped around the origin, any line joining A and the origin O must intersect 


the set K(6) = {[[Al| = ô} N C at some intermediate point of the form tA for some 
č e [0,1]. 


Since the cost function £, and regularizer ® are convex, the function F is also convex 
for any non-negative choice of the regularization parameter. Given the convexity of F, we 
can apply Jensen’s inequality so as to obtain 


FED) = F(A + (1 -6)0) < EFA + -AFO EFO, 


where equality | (i) uses the fact that F (0) = O by construction. But since A is optimal, we 
must have F (A) < 0, and hence F (fA) < 0 as well. Thus, we have constructed a vector 
A = fA with the claimed properties, thereby establishing the claim in the lemma. 


We now return to the proof of Theorem 9.19. Fix some radius ô € (0, R], whose value will 
be specified later in the proof (see equation (9.53)). On the basis of Lemma 9.21, the proof 
of Theorem 9.19 will be complete if we can establish a lower bound on the function value 
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F(A) for all vectors A € K(6). For an arbitrary vector A € (6), we have 
F(A) = L, + A) - L) + Af + A) — O(6")} 
G) 7 K 2 242 A s 
> (VL), A) + z'All — T D (A) + An {O0 + A) — D) (9.50) 


Gi) ‘ K 2 22 z 
> (VLČ), A) + zllAll — TaD (A) + A,{P(Ag.) — PAn) — 200) 


where inequality (i) follows from the RSC condition, and inequality (ii) follows from the 
bound (9.32). 
By applying Holder’s inequality with the regularizer ® and its dual ®*, we find that 


KVL), A < DVL) OCA). 
Under the event G(4,), the regularization parameter is lower bounded as 4, > 20*(VL,,(6")), 
which implies that |(V-L£,,(0"), A)| < 4 @(A). Consequently, we have 
F (A) = SIAP = T O(A) + An{P(Agi.) — PAn) - 20(6},.)} - Taa). 
The triangle inequality implies that 
D(A) = O(Ay. + Ap) < D(Ap-) + (Ap), 


and hence, following some algebra, we find that 
K 1 3 A 
F(A) 2 zlAl? - O(A) + Ans (An) - 5 (Ap) - 20(6;,.)} 
K An * 
> JIAP — 7,7(A) - 5 (304) + 40(6;,,)}. (9.51) 


Now definition (9.44) of the subspace Lipschitz constant implies that ®(A;,) < ‘Y(M) |All. 
Since the projection A > Aj is defined in terms of the norm || - ||, it is non-expansive. Since 
0 € M, we have 


G) 
Avil] = Ha) — Ha Oll < IA -Oll = IAll, 
where inequality (i) uses non-expansiveness of the projection. Combining with the earlier 
bound, we conclude that ®(A;,) < ‘Y(M)||All. 
Similarly, for any A € C, we have 
2 
(A) < [40 (Ap) + 40(6;,.)} < 32@7(Ay)) + 320° (6%) 
< 327(M) ||AIP + 320° (6y). (9.52) 


Substituting into the lower bound (9.51), we obtain the inequality 


Z Àn _ 
F(A) > {5 - 3229) [AIP — 321500) - > [3POD [|All + 40065, )} 
Gi) K 
> a 
4 
where step (ii) uses the assumed bound 72°? (M) < £. 
The right-hand side of this inequality is a strictly positive definite quadratic form in ||All, 


BAn X r 
IAI? - 5 FO) IA- 3217 (0) — 2A, DO), 
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and so will be positive for ||Al| sufficiently large. In particular, some algebra shows that this 
is the case as long as 


= a2 — 8 
AI? > (M, M+) := 9 a (IM) + z (anD) + 167;0°(6;,.)}. (9.53) 


This argument is valid as long as €, < R, as assumed in the statement. 


9.4.2 Bounds under D*-curvature 


We now turn to an alternative form of restricted curvature, one which involves a lower bound 
on the gradient of the cost function. In order to motivate the definition to follow, note that 
an alternative way of characterizing strong convexity of a differentiable cost function is via 
the behavior of its gradient. More precisely, a differentiable function £, is locally x-strongly 
convex at 6", in the sense of the earlier definition (9.37), if and only if 


(VL + A)) — VL), A) > KIAI? (9.54) 


for all A in some ball around zero. See Exercise 9.9 for verification of the equivalence be- 
tween the property (9.54) and the earlier definition (9.37). When the underlying norm || - || 
is the €-norm, then the condition (9.54), combined with the Cauchy—Schwarz inequality, 
implies that 


IVLA + A) = VLG Ilo = KAk. 


This implication suggests that it could be useful to consider alternative notions of curvature 
based on different choices of the norm. Here we consider such a notion based on the dual 
norm *: 


Definition 9.22 The cost function satisfies a ®*-norm curvature condition with cur- 
vature x, tolerance T, and radius R if 


O'(VL(0" + A) — VL) = KDA) — 7, (A) (9.55) 


for all A € Bo-(R) := {0 € Q | ©*(0) < R}. 


As with restricted strong convexity, this definition is most easily understood in application 
to the classical case of least-squares cost and ¢,-regularization: 


Example 9.23 (Restricted curvature for least-squares cost) For the least-squares cost func- 
tion, we have VL,,(6) = 1XTX(6-6") = (0-6), where È = 1X™X is the sample covariance 
matrix. For the ¢,-norm as the regularizer ®, the dual norm © is the ¢,,-norm, so that the 
restricted curvature condition (9.55) is equivalent to the lower bound 


[EAI], > xIlAllo -—TrllAll for all A € Rô. (9.56) 


In this particular example, localization to the ball B,,(R) is actually unnecessary, since the 
lower bound is invariant to rescaling of A. The bound (9.56) is very closely related to what 
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are known as ¢.,-restricted eigenvalues of the sample covariance matrix X. More precisely, 
such conditions involve lower bounds of the form 


[EAI], >’ lll, for all A € C(S; a), (9.57) 


where C(S;a@) := {A € R? | |lAsell; < ællAsllı}, and (K', œ) are given positive constants. 
In Exercise 9.11, we show that a bound of the form (9.56) implies a form of the ¢,,-RE 
condition (9.57) as long as n = |S|? logd. Moreover, as we show in Exercise 7.13, such an 
€,.-RE condition can be used to derive bounds on the @,,-error of the Lasso. 

Finally, as with €-restricted eigenvalue conditions (recall Example 9.16), a lower bound 


of the form (9.56) holds with high probability with constant x and tolerance tT, = wed 


for various types of random design matrices, Exercise 7.14 provides details on one such 
result. & 


With this definition in place, we are ready to state the assumptions underlying the main result 
of this section: 


(A1) The cost satisfies the ®*-curvature condition (9.55) with parameters (x, Tn; R). 


(A2) The regularizer is decomposable with respect to the subspace pair (M, M+) with M c 
M. 


Under these conditions, we have the following: 


Theorem 9.24 Given a target parameter & € M, consider the eRe M-estimator 
(9.3) under conditions (Al’) and (A2), and suppose that t, Y? (M) < 35. Conditioned on 


the event G(A,) N KAO — ®*) < R}, any optimal solution e satisfies o bound 


D@-0)<3 An (9.58) 
K 


= 4 


Like Theorem 9.19, this claim is deterministic given the stated conditioning. Probabilistic 
claims enter in certifying that the “good” event G(/,,) holds with high probability with a 
specified choice of A,,. Moreover, except for the special case of least squares, we need to use 
related results (such as those in Theorem 9.19) to certify that ®*(6—6") < R, before applying 
this result. 


Proof The proof is relatively straightforward given our development thus far. By standard 
optimality conditions for a convex program, for any optimum 6, there must exist a subgra- 
dient vector Z € ado) such that V.L, (0) + A,z = 0. Introducing the error vector A:=0- ao, 
some algebra yields 


VL + A) -VLO = -V L) — Anz. 
Taking the ®*-norm of both sides and applying the triangle inequality yields 
P (VLO + A) -VLO < D VLEN + A,B"). 
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On one hand, on the event G(,,), we have that ®*(VL,(6")) < A,/2, whereas, on the 
other hand, Exercise 9.6 implies that ®*(z) < 1. Putting together the pieces, we find that 
O(VL,(6 + A) — VL,(6")) < Ba, Finally, applying the curvature condition (9.55), we 
obtain 


KOAN < A +7,0). (9.59) 


It remains to bound DA) in terms of the dual norm D(A). Since this result is useful in other 
contexts, we state it as a separate lemma here: 


Lemma 9.25 If 6° €M, then 
(A) < 16¥7(M) D*A) for any A € Ca (M, M+). (9.60) 


Before returning to prove this lemma, we use it to complete the proof of the theorem. On 
the event G(,,), Proposition 9.13 may be applied to guarantee that Ae Ca(M, M+). Con- 
sequently, the bound (9.60) applies to A. Substituting into the earlier bound (9.59), we find 
that (k — 16¥7(M)r,,) *(A) < 3h, from which the claim follows by the assumption that 
PMT, < 5. 


We now return to prove Lemma 9.25. From our earlier calculation (9.45), whenever 0" € 
M and A € Ce (M, M+), then ®(A) < 4#(M) IIA]. Moreover, by Hélder’s inequality, we have 
IAIP < D(A) D*A) < PODIA D*A), 
whence |JA|| < 4#(M)®*(A). Putting together the pieces, we have 
D(A) < 4¥(M)I|Al| < 16870) D*(A), 


as claimed. This completes the proof of the lemma, and hence of the theorem. 


Thus far, we have derived two general bounds on the error 0 — 6" associated with optima 
of the M-estimator (9.3). In the remaining sections, we specialize these general results to 
particular classes of statistical models. 


9.5 Bounds for sparse vector regression 


We now turn to some consequences of our general theory for the problem of sparse regres- 
sion. In developing the theory for the full class of generalized linear models, this section 
provides an alternative and more general complement to our discussion of the sparse linear 
model in Chapter 7. 


9.5.1 Generalized linear models with sparsity 


All results in the following two sections are applicable to samples the form {(x;, y;)}_, where: 


i 
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sees 


(G2) Conditionally on x;, each response y; is drawn i.i.d. according to a conditional distri- 
bution of the form 


, ©) - Oo 
Prolo « exp? AE 


where the partition function y has a bounded second derivative (||| < B°). 


We analyze the ¢,-regularized version of the GLM log-likelihood estimator, namely 


= {1x 
0 € arg min{ + ` (Wai OY) — yi (xi, 8) } + ate} (9.61) 
deR? |n oi 
L£,(9) 

For short, we refer to this M-estimator as the GLM Lasso. Note that the usual linear model 
description y; = (x;, 6°) + w; with w; ~ N(0,c7) falls into this class with B = 1, in which 
the case the estimator (9.61) is equivalent to the ordinary Lasso. It also includes as special 
cases the problems of logistic regression and multinomial regression, but excludes the case 
of Poisson regression, due to the boundedness condition (G2). 


9.5.2 Bounds under restricted strong convexity 
We begin by proving bounds when the Taylor-series error around 6” associated with the 
negative log-likelihood (9.61) satisfies the RSC condition 
logd 


= |All; for all ||Alh < 1. (9.62) 


K 
&,(A) 2 5 All =c] 


As discussed in Example 9.17, when the covariates {x;};_; are drawn from a zero-mean sub- 
Gaussian distribution, a bound of this form holds with high probability for any GLM. 


The following result applies to any solution © of the GLM Lasso (9.61) with regulariza- 
tion parameter A, = 4 BC { logd 4 ô) for some ô € (0, 1). 


n 


Corollary 9.26 Consider a GLM satisfying conditions (G1) and (G2), the RSC condi- 
tion (9.62), and suppose the true regression vector © is supported on a subset S of car- 
dinality s. Given a sample size n large enough to ensure that s {a? + ned) < min [£ x \ 


n 9°? 64c) 
any GLM Lasso solution 0 satisfies the bounds 
—~ 9 s2? 
@- 6" 2 a n 
(@- eb < 55 


both with probability at least 1 — 2e7?"’. 


ie (Ose oe (9.63) 
K 


4 


We have already proved results of this form in Chapter 7 for the special case of the linear 
model; the proof here illustrates the application of our more general techniques. 
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Proof Both results follow via an application of Corollary 9.20 with the subspaces 
MS) = MS) ={90E€ R" |ð; =0 forall j¢S}. 


With this choice, note that we have Y? (M) = s; moreover, the assumed RSC condition (9.62) 
is a special case of our general definition with 7? = c, eed In order to apply Corollary 9.20, 
we need to ensure that TP? (M) < ģ and since the local RSC holds over a ball with radius 
R = 1, we also need to ensure that a < 1. Both of these conditions are guaranteed by 
our assumed lower bound on the sample size. 

The only remaining step is to verify that the good event G(/,,) holds with the probability 
stated in Corollary 9.26. Given the form (9.61) of the GLM log-likelihood, we can write the 
score function as the i.i.d. sum VL,,(0*) = 1 die Vi, where V; € R? is a zero-mean random 


vector with components 


Vij = {Y (xi °)) yi} Xij. 


Let us upper bound the moment generating function of these variables. For any t € R, we 
have 


log Efe] = log Efe”*] — txip ( (xi 6">) 
= y(txij + (xi, Y) = yh (xi, OY) — tri h i 0°) ). 
By a Taylor-series expansion, there is some intermediate f such that 
1 ` BP x? 
log Efe] = re a + (xj, ay) < 5 a 


where the final inequality follows from the boundedness condition (G2). Using indepen- 
dence of the samples, we have 


1 Iy” eB li eB? C? 
= F | ont die Vij Pap > 2 
n log le | = 2 = i Det 


where the final step uses the column normalization (G1) on the columns of the design matrix 
X. Since this bound holds for any t € R, we have shown that each element of the score 
function VL,,(6*) € R? is zero-mean and sub-Gaussian with parameter at most BC/n. 
Thus, sub-Gaussian tail bounds combined with the union bound guarantee that 


è nt 
PLIIV LIke > #] < zap- j loga). 


Setting t = 2B C { NI mgd + ô) completes the proof. 


9.5.3 Bounds under ¢,.-curvature conditions 


The preceding results were devoted to error bounds in terms of quadratic-type norms, such 
as the Euclidean vector and Frobenius matrix norms. On the other hand, Theorem 9.24 
provides bounds in terms of the dual norm ®*—that is, in terms of the ¢,,-norm in the case 
of ¢;-regularization. We now turn to exploration of such bounds in the case of generalized 
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linear models. As we discuss, €.-bounds also lead to bounds in terms of the £2- and £4- 
norms, so that the resulting guarantees are in some sense stronger. 

Recall that Theorem 9.24 is based on a restricted curvature condition (9.55). In the earlier 
Example 9.23, we discussed the specialization of this condition to the least-squares cost, 
and in Exercise 9.14, we work through the proof of an analogous result for generalized 
linear models with bounded cumulant generating functions (||W’’||... < B). More precisely, 
when the population cost satisfies an f.-curvature condition over the ball B2(R), and the 
covariates are i.i.d. and sub-Gaussian with parameter C, then the GLM log-likelihood L£,, 
from equation (9.61) satisfies a bound of the form 


logd 
IV L(G" + A) -VL = KA — > Be 


uniformly over B..(1). Here is co is a constant that depends only on the parameters (B, C). 


|All, (9.64) 


Corollary 9.27 In addition to the conditions of Corollary 9.26, suppose that the ta- 
curvature condition (9.64) holds, and that the sample size is lower bounded as n > 
cos” logd. Then any optimal solution @ to the GLM Lasso (9.61) with regularization 


parameter A, = 2 BC (. | wed + ô) satisfies 


IB- lle < 3 & (9.65) 
K 


with probability at least 1 — 2 e”. 


Proof We prove this corollary by applying Theorem 9.24 with the familiar subspaces 
MCS) = M(S) = {0 € R° | Ose = 0}, 


for which we have ‘¥?(IM(S)) = s. By assumption (9.64), the €.-curvature condition holds 
‘eed 


with tolerance T, = $ , so that the condition 7,,¥7(M) < 
bound n > cå s* log d on the sample size. 

Since we have assumed the conditions of Corollary 9.26, we are guaranteed that the error 
vector A = 6 — 6" satisfies the bound All < |All, < < 1 with high probability. This local- 
ization allows us to apply the local -curvature condition to the error vector A=0- 6. 


Finally, as shown in the proof of Corollary 9.26, if we choose the regularization parameter 
A, = 2BC toed + 6}, then the event G(A,,) holds with probability at least 1 — en" We 
have thus verified that all the conditions needed to apply Theorem 9.24 are satisfied. 


37 is equivalent to the lower 


The @,,-bound (9.65) is a stronger guarantee than our earlier bounds in terms of the £1- and 
€>-norms. For instance, under additional conditions on the smallest non-zero absolute values 
of 6*, the £» -bound (9.65) can be used to construct an estimator that has variable selection 
guarantees, which may not be possible with bounds in other norms. Moreover, as we explore 
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in Exercise 9.13, when combined with other properties of the error vector, Corollary 9.27 
implies bounds on the £,- and ¢2-norm errors that are analogous to those in Corollary 9.26. 


9.6 Bounds for group-structured sparsity 


We now turn to the consequences of Theorem 9.19 for estimators based on the group Lasso 
penalty with non-overlapping groups, previously discussed in Example 9.3. For concrete- 
ness, we focus on the ¢,-version of the group Lasso penalty ||6|lg¢.2 = Yiceg llêgll2. As discussed 
in Example 9.6, one motivation for the group Lasso penalty are multivariate regression prob- 
lems, in which the regression coefficients are assumed to appear on—off in a groupwise man- 
ner. The linear multivariate regression problem from Example 9.6 is the simplest example. 
In this section, we analyze the extension to generalized linear models. Accordingly, let us 
consider the group GLM Lasso 


D € arg min * 2, (WCO, xi) = yi (0, xi) } + An DIO k}, (9.66) 
BEG 
a family of estimators that includes the least-squares version of the group Lasso (9.14) as a 
particular case. 

As with our previous corollaries, we assume that the samples {(x;, y;)}7_, are drawn i.i.d. 
from a generalized linear model (GLM) satisfying condition (G2). Letting X, € R”! denote 
the submatrix indexed by g, we also impose the following variant of condition (G1) on the 
design: 


(G1’) The covariates satisfy the group normalization condition max,cg oe <C. 


Moreover, we assume an RSC condition of the form 


log |G| 
n 


E,(A) > KIAI — cı [= + | Alig. forall Alb < 1, (9.67) 
P : 

where m denotes the maximum size over all groups. As shown in Example 9.17 and Theo- 

rem 9.36, a lower bound of this form holds with high probability when the covariates {x;}'"_, 

are drawn i.i.d. from a zero-mean sub-Gaussian distribution. Our bound applies to any solu- 

tion 6 to the group GLM Lasso (9.66) based on a regularization parameter 


ay =4ac{ [+ ,| ell +o] for some 6 € (0, 1). 
n n 


Corollary 9.28 Given ni.i.d. samples from a GLM satisfying conditions (GI'), (G2), 
the RSC condition (9.67), UPRO that the true regression vector a has group support 
Sg. As long as |S ç| {ar+ 2 +2 ag SE aa < min {4° a |, the estimate 0 satisfies the bound 


9° 64c1 
9 |S g|A? 
paaa 3 (9.68) 


with probability at least 1 — 2e-?”*. 
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In order to gain some intuition for this corollary, it is worthwhile to consider some spe- 
cial cases. The ordinary Lasso is a special case of the group Lasso, in which there are 
IG| = d groups, each of size m = 1. In this case, if we use the regularization parameter 


An = 8BC 4J ng 4 the bound (9.68) implies that 


B-p x E Peed 
n 


showing that Corollary 9.28 is a natural eee of Corollary 9.26. 
The problem of multivariate regression provides a more substantive example of the po- 
tential gains of using the group Lasso. Throughout this example, we take the regularization 


parameter A, = 8BC {vz + 4/224 2) as given. 


Example 9.29 (Faster rates for multivariate regression) As previously discussed in Ex- 
ample 9.6, the problem of multivariate regression is based on the linear observation model 
Y = ZO* + W, where ©* € RP% is a matrix of regression coefficients, Y € R’? is a matrix 
of observations, and W € R”T is a noise matrix. A natural group structure is defined by the 
rows of the regression matrix ©*, so that we have a total of p groups each of size T. 

A naive approach would be to ignore the group sparsity, and simply apply the elementwise 
€,-norm as a regularizer to the matrix ©. This set-up corresponds to a Lasso problem with 
d = pT coefficients and elementwise sparsity T|S gl, so that Corollary 9.26 would guarantee 
an estimation error bound of the form 


~ Se|T 1 T 
16 - o'l < y ERD (9.69a) 
n 


By contrast, if we used the group Lasso estimator, which does explicitly model the grouping 
in the sparsity, then Corollary 9.28 would guarantee an error of the form 


r Salt Sell 
IO — Olle x 4 Balti V Sgllog p (9.69b) 
n 


n 


For T > 1, it can be seen that this error bound is always better than the Lasso error 
bound (9.69a), showing that the group Lasso is a better estimator when ©* has a sparse 
group structure. In Chapter 15, we will develop techniques that can be used to show that 
the rate (9.69b) is the best possible for any estimator. Indeed, the two components in this 
rate have a very concrete interpretation: the first corresponds to the error associated with 
estimating |S g|T parameters, assuming that the group structure is known. For |S ç| « p, the 
second term is proportional to log Ge a and corresponds to the search complexity associated 
with finding the subset of |S g| rows out of p that contain non-zero coefficients. & 


We now turn to the proof of Corollary 9.28. 


Proof We apply Corollary 9.20 using the model subspace M(S ç) defined in equation (9.24). 
From Definition 9.18 of the subspace constant with ®(@) = ||@|lg2, we have 


seg Wella _ Sal 
(Alle 


PMCS g)) := 
BEMS g)\{0} 
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The assumed RSC condition (9.62) is a special case of our general definition with the tol- 
erance parameter T2 = c1 [z + weld aa and radius R = 1. In order to apply Corollary 9.20, we 
need to ensure that 72°¥7(I!M) < x, and since the local RSC holds over a ball with radius 
R = 1, we also need to ensure that ote < 1. Both of these conditions are guaranteed by 
our assumed lower bound on the sample size. 

It remains to verify that, given the specified choice of regularization parameter A, the 
event G(A,,) holds with high probability. 


Verifying the event G(A,): Using the form of the dual norm given in Table 9.1, we have 
O*(VL,(6*)) = Maxgcg |(VL,(A"))ell2. Based on the form of the GLM log-likelihood, we 
have VL,(6") = 1 X1 Vi where the random vector V; € R? has components 


Vj = ACEN &))- yi} Xij. 


For each group g, we let V; € R!8! denote the subvector indexed by elements of g. With this 
notation, we then have 


WV LO elle = IŻ > Vigll = sup 


ueSisi-! 


(eade) 
where S's"! is the Euclidean sphere in R'*!. From Example 5.8, we can find a 1/2-covering 


of S's! in the Euclidean norm—say {u!,..., u}—with cardinality at most N < 5'81. By the 
standard discretization arguments from Chapter 5, we have 


sass 


Using the same proof as Corollary 9.26, the random variable (u ; 1 Ja Vie) is sub-Gaussian 
with parameter at most 


oy, BC 
2 oath ig) $e 


where the inequality follows from condition (G1’). Consequently, from the union bound and 
standard sub-Gaussian tail bounds, we have 


: nt” 
PLIV LC Nelle = 21] < 2exp(- Sy Taras lgl log J 
Taking the union over all |G| groups yields 
nt 
P| max [0 L0 > 21] < zex- spe tm log 5 + log el} 


where we have used the maximum group size m as an upper bound on each group size |g]. 
Setting 7? = 2? yields the result. 
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9.7 Bounds for overlapping decomposition-based norms 


In this section, we turn to the analysis of the more “exotic” overlapping group Lasso norm, 
as previously introduced in Example 9.4. In order to motivate this estimator, let us return to 
the problem of multivariate regression. 


Example 9.30 (Matrix decomposition in multivariate regression) Recall the problem of 
linear multivariate regression from Example 9.6: it is based on the linear observation model 
Y = ZO* + W, where @* € R”? is an unknown matrix of regression coefficients. As dis- 
cussed previously, the ordinary group Lasso is often applied in this setting, using the rows of 
the regression matrix to define the underlying set of groups. When the true regression ma- 
trix ©* is actually row-sparse, then we can expect the group Lasso to yield a more accurate 
estimate than the usual elementwise Lasso: compare the bounds (9.69a) and (9.69b). 

However, now suppose that we apply the group Lasso estimator to a problem for which 
the true regression matrix ©* violates the row-sparsity assumption: concretely, let us suppose 
that ©* has s total non-zero entries, each contained within a row of its own. In this setting, 
Corollary 9.28 guarantees a bound of the order 


~ sT slo 
IO- @'l 3 4> + J a (9.70) 


However, if we were to apply the ordinary elementwise Lasso to this problem, then Corol- 
lary 9.26 would guarantee a bound of the form 


EN 1 T 
I - @'le x 4 see. (9.71) 


This error bound is always smaller than the group Lasso bound (9.70), and substantially 
so for large T. Consequently, the ordinary group Lasso has the undesirable feature of be- 
ing less statistically efficient than the ordinary Lasso in certain settings, despite its higher 
computational cost. 


o* Q* T* 
y 
m o 
o 
E) 
— + o 
— E] 
E i 
E 


Figure 9.10 Illustration of the matrix decomposition norm (9.72) for the group 
Lasso applied to the matrix rows, combined with the elementwise £1-norm. The 
norm is defined by minimizing over all additive decompositions of @* as the sum 
of a row-sparse matrix Q* with an elementwise-sparse matrix [*. 


How do we remedy this issue? What would be desirable is an adaptive estimator, one that 
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achieves the ordinary Lasso rate (9.71) when the sparsity structure is elementwise, and the 
group Lasso rate (9.70) when the sparsity is row-wise. To this end, let us consider decom- 
posing the regression matrix ©* as a sum Q* + I*, where Q* is a row-sparse matrix and [* 
is elementwise sparse, as shown in Figure 9.10. Minimizing a weighted combination of the 
group Lasso and ¢,-norms over all such decompositions yields the norm 


p 
P0) = inf, {i raD, in| (9.72) 
je 
which is a special case of the overlapping group Lasso (9.10). Our analysis to follow will 
show that an M-estimator based on such a regularizer exhibits the desired adaptivity. 4 


Let us return to the general setting, in which we view the parameter 6 € R? as a vector,’ 
and consider the more general ¢,-plus-group overlap norm 


PaO) = inf (lalli + elle} 0.73) 


where G is a set of disjoint groups, each of size at most m. The overlap norm (9.72) is 
a special case, where the groups are specified by the rows of the underlying matrix. For 
reasons to become clear in the proof, we use the weight 


gj EL (0.74) 
vlog d 


With this set-up, the following result applies to the adaptive group GLM Lasso, 


Pe | 
0 € arg min * >) (WMO, xi) = (0, xy) } 100), (9.75) 
oR |n = 
L£,) 
for which the Taylor-series error satisfies the RSC condition 
lo 


Bdo) forall [Alb < 1. (9.76) 
n 


K 
E,(A) = 5llAll -cı 


Again, when the covariates {x;}7_, are drawn i.i.d. from a zero-mean sub-Gaussian distribu- 
tion, a bound of this form holds with high probability for any GLM (see Example 9.17 and 
Exercise 9.8). 


With this set-up, the following result applies to any optimal solution © of the adaptive 
group GLM Lasso (9.75) with A, = 4BC(4/*£4 + 8) for some 6 € (0,1). Moreover, it 


n 
supposes that the true regression vector can be decomposed as 6* = a* + p*, where a is 


S e-Sparse, and 6* is S g-group-sparse, and with Sg disjoint from S et- 


3 The problem of multivariate regression can be thought of as a particular case of the vector model with vector 
dimension d = pT, via the transformation @ +> vec(®) € RPT. 
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Corollary 9.31 Given n i.i.d. samples from a GLM satisfying conditions (GI’) and 
(G2), ae that the RSC condition (9.76) with curvature k > O holds, and that 


{vi S erl + w Sal} [a2 + wee} < min {% 


estimate 0 satisfies the bounds 


Bae an \ Then the adaptive group GLM Lasso 


ie- Iĝ < 


i ViS ul + w VIS gl i} (9.77) 


with probability at least 1 — 3e~8"*. 


Remark: The most important feature of the bound (9.77) is its adaptivity to the elementwise 
versus group sparsity. This adaptivity stems from the fact that the choices of S ç and S e can 
be optimized so as to obtain the tightest possible bound, depending on the structure of the 


regression vector 6*. To be concrete, consider the bound with the choice 2, = 8BC pee 


At one extreme, suppose that the true regression vector 6* € R? is purely elementwise 
sparse, in that each group contains at most one non-zero entry. In this case, we can apply the 
bound (9.77) with Sg = 0, leading to 


=~ 2 B E s logd 
le@- 1g < —— —, 
K n 
where s = |S e| denotes the sparsity of 6°. We thus recover our previous Lasso bound from 
Corollary 9.26 in this special case. At the other extreme, consider a vector that is “purely” 
group-sparse, in the sense that it has some subset of active groups S ç, but no isolated sparse 


entries. The bound (9.77) with Sen = @ then yields 
B? C? mS gl |S gllogd 
K ü , 


n n 


ie- ~È x 


so that, in this special case, the decomposition method obtains the group Lasso rate from 
Corollary 9.28. 


Let us now prove the corollary: 


Proof In this case, we work through the details carefully, as the decomposability of the 
overlap norm needs some care. Recall the function F from equation (9.31), and let A = 
0- 6. Our proof is based on showing that any vector of the form A = tA for some t € [0, 1] 
satisfies the bounds 


DAA) < 4{ VIS el +o vViSgHlAlb (9.78a) 


and 
Kio ee d > 3An 
F(A) > SIAI - c1 EE - E (VS ad +  viSel} IA (9.78b) 
Let us take these bounds as a for the moment, and then return to prove them. Substituting 


the bound (9.78a) into inequality (9.78b) and PrE yields 


> pate fena- (Val + o va} 
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where «’ := 5 — 16c; logd ( 1S eul + w Saly. Under the stated bound on the sample size n, 


n 
we have x’ > *, so that F is non-negative whenever 


si 
6A 
lAl > = ( VS el + w vil). 


n 
K 


Finally, following through the remainder of the proof of Theorem 9.19 yields the claimed 
bound (9.77). 

Let us now return to prove the bounds (9.78a) and (9.78b). To begin, a straightforward 
calculation shows that the dual norm is given by 


1 
©* (v) = max ¢||v||,., — max ||v. ; 
lV) fi II Rae II le} 


Consequently, the event G(A,) := {®2(VL,(6")) < ©} is equivalent to 


A, . Anw 
[VLC < 5 and max ||(VL,,(6"))ell2 < : (9.79) 
BEG 2 


We assume that these conditions hold for the moment, returning to verify them at the end of 
the proof. 

Define A = fA for some t € [0, 1]. Fix some decomposition 6* = a* + 6*, where a* is 
S e-Sparse and 6* is S g-group-sparse, and note that 


P6") < lla“ + oll’ lle2 
Similarly, let us write A = A, + Ag for some pair such that 


PF + A) = ||Aalli + ollAgllg2. 


Proof of inequality (9.78a): Define the function 
F(A) := L, + A) — L,(6") + A, {DA + A) — D. 


Consider a vector of the form A = fA for some scalar t € [0, 1]. Noting that F is convex and 
minimized at A, we have 


F(A) = F(tA + (1-10) < tF (A) + (1 - NF (0) < F(O). 


Recalling that 6,(A) = £,(6* + A) — £,(0*) — (VL,,(6"), A), some algebra then leads to the 
inequality 


EA) < KVL), AY] = An fla" + Alh = lalh) = Ane fI + Allez — IB" lle} 


2 (VLO, A| + Ay MAdsali = Asg h} + ne fMApselle2 — Il(Ap seller} 


elt 
Gi) An 


Anw 
SF [Asali = Aass} + 


2 


{IIAp)sclle2 — Apsolle.) - 


Here step (i) follows by decomposability of the €, and the group norm, and step (ii) follows 
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by using the inequalities (9.79). Since &,(A) > 0 by convexity, rearranging yields 
[ell + wllAglle2 < 4{ll(Aa)salli + ollAs)sglle2| 
È Aiad Adsule + 0 VE Apso} 
<4{ VS al +o -ViSol} {Il(Ac)sulls + KApssl2}, (9-80) 


where step (iii) follows from the subspace constants for the two decomposable norms. The 
overall vector A has the decomposition A = (Aq)s,,+(Ag)s, +Ar, where T is the complement 
of the indices in Sex and S ç. Noting that all three sets are disjoint by construction, we have 


Aa saull2 + IApdsglle = MAedsa + Asell < IAll2- 


Combining with inequality (9.80) completes the proof of the bound (9.78a). 


Proof of inequality (9.78b): From the proof of Theorem 9.19, recall the lower bound (9.50). 
This inequality, combined with the RSC condition, guarantees that the function value F(A) 
is at least 


lo 


d 
ETa (A) = KYLE), A) 


n 
+ Afla” + Alli = lalli} + Anw{lIB" + Aglle,2 — l6“ lle.2} 


K 
5llAll =C] 


Again, applying the dual norm bounds (9.79) and exploiting decomposability leads to the 
lower bound (9.78b). 


Verifying inequalities (9.79): The only remaining detail is to verify that the conditions (9.79) 
defining the event G(,,). From the proof of Corollary 9.26, we have 


PVL lo >t] < de me. 


Similarly, from the proof of Corollary 9.28, we have 


2 42 


1 nwt 
P[- max IV Lell = 21] < 2exp eae +m log5 + log |G|]. 


n 


P[G(A,)] = 1- 3e’. 


Setting t = 4BC { oed 4 ô) and performing some algebra yields the claimed lower bound 


9.8 Techniques for proving restricted strong convexity 


All of the previous results rely on the empirical cost function satisfying some form of re- 
stricted curvature condition. In this section, we turn to a deeper investigation of the condi- 
tions under which restricted strong convexity conditions, as previously formalized in Defi- 
nition 9.15, are satisfied. 

Before proceeding, let us set up some notation. Given a collection of samples Z} = {Z} 
we write the empirical cost as £,(0) = 1 Xi, LG; Zi), where £ is the loss applied to a single 
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sample. We can then define the error in the first-order Taylor expansion of £ for sample Z;, 
namely 


8A; Zi) = LM + A; Zi) — LE"; Zi) — (VL; Zi), A). 


By construction, we have &,(A) = a &(A;Z;). Given the population cost function 
L(6) := E[L,(6; Z})], a local form of strong convexity can be defined in terms of its Taylor- 
series error 


&(A) := LO" + A) — £(6") - (VEO), A) : (9.81) 


We say that the population cost is (locally) x-strongly convex around the minimizer 6” if 
there exists a radius R > 0 such that 


E(A) > K||Al? for all A € B2(R) := {A € Q | |lll2 < R}. (9.82) 


We wish to see when this type of curvature condition is inherited by the sample-based error 
&,(A). At a high level, then, our goal is clear: in order to establish a form of restricted 
strong convexity (RSC), we need to derive some type of uniform law of large numbers (see 
Chapter 4) for the zero-mean stochastic process 


few -Ẹ(A), A€ l (9.83) 


where S is a suitably chosen subset of B(R). 


Example 9.32 (Least squares) To gain intuition in a specific example, recall the quadratic 
cost function L£(6; y;, xi) = 1O - (0, x;)* that underlies least-squares regression. In this case, 
we have &(A; x; yi) = 4 (A, xi)”, and hence 


ly 1 
E,(A) = — ) (A, xi)” = IXA, 
(A) m Ou x)? = 5 |IXAI5 
where X € R” is the usual design matrix. Denoting E = cov(x), we find that 


E&A) = E[E,(A)] = SATZA. 


Thus, our specific goal in this case is to establish a uniform law for the family of random 
variables 


1 XTX 
{54"( = -Z)A, Ae e (9.84) 


When S = B,(1), the supremum over this family is equal to the operator norm x — Xlll, 
a quantity that we studied in Chapter 6. When S involves an additional ¢,-constraint, then 
a uniform law over this family amounts to establishing a restricted eigenvalue condition, as 
studied in Chapter 7. 4 


9.8.1 Lipschitz cost functions and Rademacher complexity 


This section is devoted to showing how the problem of establishing RSC for Lipschitz cost 
functions can be reduced to controlling a version of the Rademacher complexity. As the 
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reader might expect, the symmetrization and contraction techniques from Chapter 4 turn out 
to be useful. 
We say that £ is locally L-Lipschitz over the ball E2(R) if for each sample Z = (x, y) 


|L;Z) - LO; Z| < L Ko, x- (0, x)| for all 0,0 € By(R). (9.85) 


Let us illustrate this definition with an example. 


Example 9.33 (Cost functions for binary classification) The class of Lipschitz cost func- 
tions includes various objective functions for binary classification, in which the goal is to use 
the covariates x € R¢ to predict an underlying class label y € {—1, 1}. The simplest approach 
is based on a linear classification rule: given a weight vector 6 € Rf, the sign of the inner 
product (0, x) is used to make decisions. If we disregard computational issues, the most nat- 
ural cost function is the 0-1 cost [Ly (8, x) < 0], which assigns a penalty of 1 if the decision 
is incorrect, and returns 0 otherwise. (Note that y (0, x) < 0 if and only if sign({@, x)) + y.) 
For instance, the logistic cost takes the form 


L£(6; (x, y)) := log. + ef) — y (0, x), (9.86) 


and it is straightforward to verify that this cost function satisfies the Lipschitz condition with 
L = 2. Similarly, the support vector machine approach to classification is based on the hinge 
cost 


L£(6; (x, y)) := max {0,1 — y (0, x)} = (1 -y (0, x)),, (9.87) 


which is Lipschitz with parameter L = 1. Note that the least-squares cost function £(0; (x, y)) 


= to — (6, xy}? is not Lipschitz unless additional boundedness conditions are imposed. A 


similar observation applies to the exponential cost function £(6; (x, y)) = e%”. 4 


In this section, we prove that Lipschitz functions with regression-type data z = (x, y) sat- 
isfy a certain form of restricted strong convexity, depending on the tail fluctuations of the 
covariates. The result itself involves a complexity measure associated with the norm ball of 
the regularizer ®. More precisely, letting {e;}7_, be an i.i.d. sequence of Rademacher vari- 
ables, we define the symmetrized random vector x, = 1 yi €X; and the random variable 


®*(X,) := sup ( ; Yam). (9.88) 


(6)<1 


When x; ~ N(0, I4), the mean E[®*(x,,)] is proportional to the Gaussian complexity of the 
unit ball {9 € R? | ®(6) < 1}. (See Chapter 5 for an in-depth discussion of the Gaussian 
complexity and its properties.) More generally, the quantity (9.88) reflects the size of the 
@-unit ball with respect to the fluctuations of the covariates. 


The following theorem applies to any norm © that dominates the Euclidean norm, in the 
sense that D(A) > ||All2 uniformly. For a pair of radii O < Rẹ < R,, it guarantees a form of 
restricted strong convexity over the “donut” set 


Bo(Rr, Ru) := {A € R° | Re < [lAlb < Ru} (9.89) 
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The high-probability statement is stated in terms of the random variable ®*(x,,), as well as 
the quantity M,„(®; R) := 4 log () log sup,40 a)» which arises for technical reasons. 


Theorem 9.34 Suppose that the cost function L is L-Lipschitz (9.85), and the popula- 
tion cost L is locally x-strongly convex (9.82) over the ball By(R,,). Then for any 6 > 0, 
the first-order Taylor error E, satisfies 


&,(A) - E(A)| < 16OL M(A)6 for all A € Bo(Re, Ry) (9.90) 
with probability at least 1 — M,(®; R) inf j.9 Ele?“ ]. 
~ 


For Lipschitz functions, this theorem reduces the question of establishing RSC to that of 
controlling the random variable ®*(x,,). Let us consider a few examples to illustrate the con- 
sequences of Theorem 9.34. 


Example 9.35 (Lipschitz costs and group Lasso) Consider the group Lasso norm ®(@) = 
Deeg llOcll2, where we take groups of equal size m for simplicity. Suppose that the covariates 
{x}; are drawn i.i.d. as N(0, X) vectors, and let og” = |IXll2. In this case, we show that for 
any L-Lipschitz cost function, the inequality 


5 2log |G| 
lE,(A) - EA)| < 16L 0 zs \— sal Slidell 


BG 


nd 


holds uniformly for all A € B>(4, 1) with probability at least 1 — 4 log?(d) eT, 
In order to establish this claim, we begin by noting that O*(x,) = maxgeg ||(Xn)gll2 from 
Table 9.1. Consequently, we have 


Fer? Gn] < > F fena] = oS F [e eea] AEE] 
BEG 8G 


By Theorem 2.26, the random variable ||(x,,)¢||: has sub-Gaussian concentration around its 


z z o2 5 3 
mean with parameter o/yn, whence E[et!(@)sllz-Ell@sleD]_< e°% , By Jensen’s inequality, 


we have 
C [l(Xn ello] < y L (IX, )ell5] < aa 


using the fact that o° = |IE]l2. Putting together the pieces, we have shown that 


Sha 7 vo? 
> Ef et’ An- WO] < : 
mi log Ele ] < logigi+ nef 2n 


2 

ne 
—Ae> =l -—. 
e} oglGl- 5 


With the choices R, = 1 and Rẹ = L, we have 
M,(®;R) = 4log(d) log |G| < 4log*(d), 


since |G| < d. Thus, setting € = 20{ logigi 4 e} and applying Theorem 9.34 yields the 


n 


stated claim. & 


9.8 Techniques for proving restricted strong convexity 301 


In Chapter 10, we discuss some consequences of Theorem 9.34 for estimating low-rank ma- 
trices. Let us now turn to its proof. 


Proof Recall that 
E(A; zi) := LOM + A; zi) — LG"; zi) — (VLG; zi), A) 
denotes the Taylor-series error associated with a single sample z; = (xj, yi). 
Showing the Taylor error is Lipschitz: We first show that & is a 2L-Lipschitz function in 


(A, x;). To establish this claim, note if that we let £ denote the derivative of £ with respect 
to u = (0, x), then the Lipschitz condition implies that lêle < L. Consequently, by the 


Ou 
chain rule, for any sample z; € Z and parameters A, A € R?, we have 


(VLE; Z), A-A)| < oz] KA, x) — (A, x)| < LKA, xi) - (A, x) 9.91) 


Putting together the pieces, for any pair A, A, we have 
JE(A; Zi) - EA; Z)| < |LO" + A; Z) - LO" + A; Z)| + (VLGsZ,), A- A)| 
< 2LKA, xi) — (A, x:)l, (9.92) 
where the second inequality applies our Lipschitz assumption, and the gradient bound (9.91). 


Thus, the Taylor error is a 2L-Lipschitz function in (A, ~;). 


Tail bound for fixed radii: Next we control the difference |€,,(A) — &(A)| uniformly over 
certain sets defined by fixed radii. More precisely, for positive quantities (1,72), define the 
set 


C(ri, r2) = Ba(r2) N {®(A) < ri llAlb}, 


and the random variable A„(r1, r2) := TI SUP peer.) &,(A) — E(A)|. The choice of radii 
can be implicitly understood, so that we adopt the shorthand A). 

Our goal is to control the probability of the event {A,, > ô}, and we do so by controlling the 
moment generating function. By our assumptions, the Taylor error has the additive decom- 
position &,(A) = 1 Èi- (A; Zi). Thus, letting {¢;}',, denote an i.i.d. Rademacher sequence, 
applying the symmetrization upper bound from Proposition 4.11(b) yields 


AeC(r1,r2) 


ts £i sasz] . 


E[e“"] < Eze 2a 
lem] < Ez, lf ve 4Lrin n = 


Now we have 


Eje] 2 E bf a E j si lA, x J| <E leofa o(2 Se)}, 


su 
TIF2 Aecio) N i=1 


where step (i) uses the Lipschitz property (9.92) and the Ledoux—Talagrand contraction in- 
equality (5.61), whereas step (ii) follows from applying Hélder’s inequality to the regu- 
larizer and its dual (see Exercise 9.7), and uses the fact that ®*(A) < rır for any vector 
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A € C(r,,r2). Adding and subtracting the scalar 6 > 0 then yields 


1 n 
log E [err] < -Ad + log E lesofa o*(- bp an} > 
n 


i=1 


and consequently, by Markov’s inequality, 


P[An(r1, 72) > ô] < inf E [exp (2{0°(z,) — 6})]. (9.93) 


Extension to uniform radii via peeling: This bound (9.93) applies to fixed choice of 
quantities (7), r2), whereas the claim of Theorem 9.34 applies to possibly random choices— 
namely, ar and ||All2, respectively, where A might be chosen in a way dependent on the 
data. In order to extend the bound to all choices, we make use of a peeling argument. 
Let & be the event that the bound (9.90) is violated. For positive integers (k, £), define the 
sets 
D(A 
Spe := fA € Rf |2! < Ha < 2% and 2°'Rr < ||All2 < 2'Rr}. 
2 
By construction, any vector that can possibly violate the bound (9.90) is contained in the 
union (is ija Ske, where N; := [log supy.o an | and Nz := [log male Suppose that the 


bound (9.90) is violated by some Ae Sze. In this case, we have 


r eee A) ~ 
E,(A) — E(A)| > 16L a ) |All, 6 > 16L2*12°"R, 6 = 4L2'2°R; 6, 
All 


IlAll2 


which implies that Ag Re) = ô. Consequently, we have shown that 


Ni N2 
PIE] < $, $, PIA, (24, 2'Re) > 6] < Ni N2 HEEE Se 


k=1 ¢=1 


where the final step follows by the union bound, and the tail bound (9.93). Given the upper 
bound N,N, < 4 log(supgso I log(#) = M,,(®; R), the claim follows. 


9.8.2 A one-sided bound via truncation 


In the previous section, we actually derived two-sided bounds on the difference between the 
empirical &, and population & form of the Taylor-series error. The resulting upper bounds 
on &, guarantee a form of restricted smoothness, one which is useful in proving fast con- 
vergence rates of optimization algorithms. (See the bibliographic section for further details.) 
However, for proving bounds on the estimation error, as has been our focus in this chapter, it 
is only restricted strong convexity—that is, the lower bound on the Taylor-series error—that 
is required. 

In this section, we show how a truncation argument can be used to derive restricted strong 
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convexity for generalized linear models. Letting {¢;}"_, denote an i.i.d. sequence of Rade- 
macher variables, we define a complexity measure involving the dual norm ®*—namely 


ee hed Se We opera 
Hn (®*) := XE fo ($ e)]- |s Si ala, Xi) : 


i=l DA 1 4 


This is simply the Rademacher complexity of the linear function class x => (A, x) as A 
ranges over the unit ball of the norm ®. 

Our theory applies to covariates {x;}"_, drawn i.i.d. from a zero-mean distribution such 
that, for some positive constants (a, 8), we have 


E| (A, xy ]=a@ and E|(A,x)*|]<B forall vectors A € R° with IIAllz = 1. (9.94) 


Theorem 9.36 Consider any generalized linear model with covariates drawn from a 
zero-mean distribution satisfying the condition (9.94). Then the Taylor-series error &, 
in the log-likelihood is lower bounded as 


E,(A) > “IAI — co 2(@") DXA) forall A € R? with |All < 1 (9.95) 


SCA 


with probability at least 1 — cie 
È 4 


In this statement, the constants (k, Co, C1, C2) can depend on the GLM, the fixed vector 6* and 
(a, 8), but are independent of dimension, sample size, and regularizer. 


Proof Using a standard formula for the remainder in the Taylor series, we have 
1 n 
&,(A) = - N 6, i) +t(A, i A, iy, 
(A) 2 (6°, xi) + £A, xi) ) (A, xi) 


for some scalar t € [0,1]. We proceed via a truncation argument. Fix some vector A € R? 
with Euclidean norm ||Al|2 = 6 € (0, 1], and set t = Kô for a constant K > 0 to be chosen. 
Since the function y,(u) = u7I[|u| < 27] lower bounds the quadratic and y” is positive, we 
have 


1 Z A X laa 
EA) > = D W (O, xi) HELA, x) ) PKA, xP) UK, DSTI 0.96) 
i=l 
where T is a second truncation parameter to be chosen. Since y, vanishes outside the interval 


[-2r, 2r] and T < K, for any positive term in this sum, the absolute value |(6*, x;) +t (A, x;)| 
is at most T + 2K, and hence 


1X . 
EA) > — X PKA, x) O, lS T] where y := minwcriox WW). 
i=1 


Based on this lower bound, it suffices to show that for all 5 € (0, 1] and for A € R? with 
\|Allo = 6, we have 


1 n 
= > ProCA X) UO", xi) < T] 2 €36 = capty(@*)O(A) 6. (9.97) 
i=1 
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When this bound holds, then inequality (9.95) holds with constants (x, co) depending on 
(c3, C4, Y). Moreover, we claim that the problem can be reducing to proving the bound (9.97) 
for 6 = 1. Indeed, given any vector with Euclidean norm ||All2 = 6 > 0, we can apply the 
bound (9.97) to the rescaled unit-norm vector A/6 to obtain 


1 Š O(A 
= > Pur(A/6, xi)) U6", xi) | < T] = ofi — C4pn(®") ( ‘} 
i=1 


ô 
where t(1) = K, and t(8) = Kô. Noting that y7(1)(u/6) = (1/6) pro lu), the claim follows by 
multiplying both sides by 6. Thus, the remainder of our proof is devoted to proving (9.97) 


with 6 = 1. In fact, in order to make use of a contraction argument for Lipschitz functions, 
it is convenient to define a new truncation function 


F(U) = u? l[lu] < T] + (u — 27} l[t < u < 27] + (u + 27}? I-27 < u < =r]. 


Note that it is Lipschitz with parameter 2r. Since Y+, lower bounds g+, it suffices to show that 
for all unit-norm vectors A, we have 


15- 
= > Pr As x) ) ME, mi) | < T] 2 c3 = captn(®)OCA),. (9.98) 
i=l 


For a given radius r > 1, define the random variable 


15 z 
Z,(r) := ve, A > PKA, xi)) IO, xi) | < T] ERKA, xL, x)| < TI] 
der! =I 


Suppose that we can prove that 


2E 3 
E[P-UA, OME, 1< TI] > Fa (9.99a) 
and 
P Aee A * 
PIZ,(r) >a/2+ Carin ®’)| < exp (-. — en); (9.99b) 


The bound (9.98) with c3 = a@/4 then follows for all vectors with unit Euclidean norm and 
@(A) < r. Accordingly, we prove the bounds (9.99a) and (9.99b) here for a fixed radius r. A 
peeling argument can be used to extend it to all radii, as in the proof of Theorem 9.34, with 
the probability still upper bounded by ce”. 


Proof of the expectation bound (9.99a): We claim that it suffices to show that 


[> @ 7 = . Gi) 1 
E[G.(A, x] > ga, and E[G.(A, x)) B46", x) 1 > TI] < Za. 
Indeed, if these two inequalities hold, then we have 


ELPA, x) JOO", x)| < TI = ECKA, x))] - EKA, x))I[ 6", x)| > TT 
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We now prove inequalities (i) and (ii). Beginning with inequality (i), we have 
EIP(A, x))] = E| (A, x) OKA, x) < tI] = EKA, x) - E| (A, x)? IKA, x) > TI] 
> æ — E| (A, x) KA, x > TI], 


so that it suffices to show that the last term is at most @/8. By the condition (9.94) and 
Markov’s inequality, we have 


PIKA, x)| > tT] < 


IA 


FIA, x)*] B 
Tî 4 


and 


PL Aa = £. 
Recalling that r = K when 6 = 1, applying the Cauchy—Schwarz inequality yields 


E[ (A, PUKA, Xl > rl] < VEKA, 29") VPIKA > al < h, 


so that setting K? = 86/a guarantees an upper bound of œ/8, which in turn implies inequal- 
ity (i) by our earlier reasoning. 


Turning to inequality (ii), since 


ne 114 
Bl Il, 
4 > 


PCA, x) < (A, x} and PIKE, x)| > T] < T 


the Cauchy—Schwarz inequality implies that 


BIO 
T2? ` 


ELAKA, xL, x)| > T] < 


Thus, setting T? = 86||6*||;/a guarantees inequality (ii). 


Proof of the tail bound (9.99b): By our choice r = K, the empirical process defining 
Z,(r) is based on functions bounded in absolute value by K?. Thus, the functional Hoeffding 
inequality (Theorem 3.26) implies that 


PIZ,(r) > E[Z,(r)] + rpin(®*) + @/2] < eco rea 


As for the expectation, letting {¢;}7_, denote an i.i.d. sequence of Rademacher variables, the 
usual symmetrization argument (Proposition 4.11) implies that 


i g 
E[Zn(r)] < 2sup a sup |- X eKA, xi)) ME, x1 < T| | 
Ala | & 

DNE i=l 
Since I[| (6", x;)| < T] < 1 and @, is Lipschitz with parameter 2K, the contraction principle 
yields 


< 8KrE [o(- y &ix;)], 


i=1 


EIZ,()] < 8K -l sup |Ż ` ei (A, x)| 


lAle=1 72 
@(A)<r 


where the final step follows by applying Hölder’s inequality using ® and its dual ®*. 
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9.9 Appendix: Star-shaped property 


Recall the set C previously defined in Proposition 9.13. In this appendix, we prove that C 
is star-shaped around the origin, meaning that if A € C, then tA € C for all ¢ € [0,1]. This 
property is immediate whenever 6” € M, since C is then a cone, as illustrated in Figure 9.7(a). 
Now consider the general case, when 6* ¢ M. We first observe that for any t € (0, 1], 


0 
Ta (tA) = arg min |A — Øll = t arg min la = -I = tIy(A), 
oeM oeM t 


using the fact that @/t also belongs to the subspace M. A similar argument can be used to 
establish the equality Ip: (tA) = tp- (A). Consequently, for all A € C, we have 


DAp:(tA)) = OCT; (A) 2 t D(A) 


< t {3 PAIKA) + 4@(6),.)} 


where step (i) uses the fact that any norm is positive homogeneous," and step (ii) uses the 
inclusion A € C. We now observe that 3 t ®(I];(A)) = 3 WM] ;(tA)), and moreover, since 
t € (0, 1], we have 4t D(L) < 4O(@},,). Putting together the pieces, we find that 


D(I pi (tA)) < 3 Op (tA)) +4100) <3 Op (tA)) + 40(6%,,.), 


showing that tA € C for all t € (0, 1], as claimed. 


9.10 Bibliographic details and background 


The definitions of decomposable regularizers and restricted strong convexity were intro- 
duced by Negahban et al. (2012), who first proved a version of Theorem 9.19. Restricted 
strong convexity is the natural generalization of a restricted eigenvalue to the setting of gen- 
eral (potentially non-quadratic) cost functions, and general decomposable regularizers. A 
version of Theorem 9.36 was proved in the technical report (Negahban et al., 2010) for the 
€,;-norm; note that this result allows for the second derivative y” to be unbounded, as in 
the Poisson case. The class of decomposable regularizers includes the atomic norms studied 
by Chandrasekaran et al. (2012a), whereas van de Geer (2014) introduced a generalization 
known as weakly decomposable regularizers. 

The argument used in the proof of Theorem 9.19 exploits ideas from Ortega and Rhein- 
boldt (2000) as well as Rothman et al. (2008), who first derived Frobenius norm error bounds 
on the graphical Lasso (9.12). See Chapter 11 for a more detailed discussion of the graphical 
Lasso, and related problems concerning graphical models. The choice of regularizer defin- 
ing the “good” event G(A,,) in Proposition 9.13 is known as the dual norm bound. It is a 
cleanly stated and generally applicable choice, sharp for many (but not all) problems. See 
Exercise 7.15 as well as Chapter 13 for a discussion of instances in which it can be im- 
proved. These types of dual-based quantities also arise in analyses of exact recovery based 
on random projections; see the papers by Mendelson et al. (2007) and Chandrasekaran et 
al. (2012a) for geometric perspectives of this type. 

The ¢,/£2 group Lasso norm from Example 9.3 was introduced by Yuan and Lin (2006); 


4 Explicitly, for any norm and non-negative scalar t, we have ||txl] = t|lxll. 
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see also Kim et al. (2006). As a convex program, it is a special case of second-order cone 
program (SOCP), for which there are various efficient algorithms (Bach et al., 2012; Boyd 
and Vandenberghe, 2004). Turlach et al. (2005) studied the ¢,/€., version of the group 
Lasso norm. Several groups (Zhao et al., 2009; Baraniuk et al., 2010) have proposed unify- 
ing frameworks that include these group-structured norms as particular cases. See Bach et 
al. (2012) for discussion of algorithmic issues associated with optimization involving group 
sparse penalties. Jacob et al. (2009) introduced the overlapping group Lasso norm discussed 
in Example 9.4, and provide detailed discussion of why the standard group Lasso norm with 
overlap fails to select unions of groups. A number of authors have investigated the statistical 
benefits of the group Lasso versus the ordinary Lasso when the underlying regression vector 
is group-sparse; for instance, Obozinski et al. (2011) study the problem of variable selection, 
whereas the papers (Baraniuk et al., 2010; Huang and Zhang, 2010; Lounici et al., 2011) 
provide guarantees on the estimation error. Negahban and Wainwright (2011a) study the 
variable selection properties of ¢,/..-regularization for multivariate regression, and show 
that, while it can be more statistically efficient than ¢;-regularization with complete shared 
overlap, this gain is surprisingly non-robust: it is very easy to construct examples in which 
it is outperformed by the ordinary Lasso. Motivated by this deficiency, Jalali et al. (2010) 
study a decomposition-based estimator, in which the multivariate regression matrix is de- 
composed as the sum of an elementwise-sparse and row-sparse matrix (as in Section 9.7), 
and show that it adapts in the optimal way. The adaptive guarantee given in Corollary 9.31 
is of a similar flavor, but as applied to the estimation error as opposed to variable selection. 

Convex relaxations based on nuclear norm introduced in Example 9.8 have been the focus 
of considerable research; see Chapter 10 for an in-depth discussion. 

The ®*-norm restricted curvature conditions discussed in Section 9.3 are a generalization 
of the notion of &..-restricted eigenvalues (van de Geer and Biihlmann, 2009; Ye and Zhang, 
2010; Bühlmann and van de Geer, 2011). See Exercises 7.13, 7.14 and 9.11 for some anal- 
ysis of these €..-RE conditions for the usual Lasso, and Exercise 9.14 for some analysis for 
Lipschitz cost functions. Section 10.2.3 provides various applications of this condition to 
nuclear norm regularization. 
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Exercise 9.1 (Overlapping group Lasso) Show that the overlap group Lasso, as defined by 
the variational representation (9.10), is a valid norm. 


Exercise 9.2 (Subspace projection operator) Recall the definition (9.20) of the subspace 
projection operator. Compute an explicit form for the following subspaces: 
(a) For a fixed subset $ C {1,2,...,d}, the subspace of vectors 
M(S):={@¢R*|6;=0 forall jg S}. 
(b) For a given pair of r-dimensional subspaces U and V, the subspace of matrices 
M(U, Y) := {® € R” | rowspan(®) C U, colspan(®) € V}, 


where rowspan(@) and colspan(@) denote the row and column spans of ©. 
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Exercise 9.3 (Generalized linear models) This exercise treats various cases of the general- 
ized linear model. 


(a) Suppose that we observe samples of the form y = (x, 0} + w, where w ~ N(0, o°). Show 
that the conditional distribution of y given x is of the form (9.5) with e(o) = go? and 
W(t) = t?/2. 

(b) Suppose that y is (conditionally) Poisson with mean 2 = e®®. Show that this is a special 
case of the log-linear model (9.5) with c(o) = 1 and y(t) = e. 


Exercise 9.4 (Dual norms) In this exercise, we study various forms of dual norms. 


(a) Show that the dual norm of the ¢,-norm is the &.-norm. 
(b) Consider the general group Lasso norm 


Du) = lllli.gipy = > lugllp» 
BG 


where p € [1, co] is arbitrary, and the groups are non-overlapping. Show that its dual 
norm takes the form 


O*(v) E lvla = max ||Vella. 
BEG 


where q = — is the conjugate exponent to p. 
p-l 
(c) Show that the dual norm of the nuclear norm is the f,-operator norm 


®*(N) = IINIl2 := sup ||Nzll2. 


lizll2=1 


(Hint: Try to reduce the problem to a version of part (a).) 
Exercise 9.5 (Overlapping group norm and duality) Let p € [1, œ], and recall the overlap- 
ping group norm (9.10). 
(a) Show that it has the equivalent representation 


®(u) = max (v, u) such that |lv,||, < 1 forall g € G, 
veR? 


where g = rar is the dual exponent. 
(b) Use part (a) to show that its dual norm is given by 
O*(v) = max ||Volly. 
(v) = max [hel 
Exercise 9.6 (Boundedness of subgradients in the dual norm) Let ® : R? > R be a norm, 


and @ € R? be arbitrary. For any z € 0®(6), show that ®*(z) < 1. 


Exercise 9.7 (Hélder’s inequality) Let ® : R? — R, be a norm, and let ®* : R? > R, be 
its dual norm. 


(a) Show that | <u, v)| < ®(u) ®*(v) for all u,v € Rf. 


9.11 Exercises 309 
(b) Use part (a) to prove Hélder’s inequality for €,-norms, namely 


| <u, v) | < llullp Ilvllg 


where the exponents (p, q) satisfy the conjugate relation 1/p + 1/q = 1. 
(c) Let Q > 0 be a positive definite symmetric matrix. Use part (a) to show that 


| du, v)| < Vu'Qu yv'Q"'y forall u,v € R°. 


Exercise 9.8 (Complexity parameters) This exercise concerns the complexity parameter 
Hn(®*) previously defined in equation (9.41). Suppose throughout that the covariates {x;}'_, 
are drawn i.i.d., each sub-Gaussian with parameter o. 


(a) Consider the group Lasso norm (9.9) with group set G and maximum group size m. 


Show that 
| [1 
Un(®*) Xo Bae log|IG| 
n n 


(b) For the nuclear norm on the space of dı x d) matrices, show that 


Mn(®*) X wale ge 
n n 


Exercise 9.9 (Equivalent forms of strong convexity) Suppose that a differentiable function 
f : R? = R is k-strongly convex in the sense that 


£6) 2 9G) CF) Se 5l -x for all x,y € R°. (9.100a) 
Show that 
VO- Yf, y- x) > kly- x forall x,y € R4. (9.100b) 


Exercise 9.10 (Implications of local strong convexity) Suppose that f : R? > R is a twice 
differentiable, convex function that is locally x-strongly convex around x, in the sense that 
the lower bound (9.100a) holds for all vectors z in the ball B(x) := {z € R? | Iiz — xll2 < 1}. 
Show that 


VFO- VE), y- x) = klly — xl for all y € R“\B3(2). 
Exercise 9.11 (¢.,-curvature and RE conditions) In this exercise, we explore the link be- 
tween the @,,-curvature condition (9.56) and the ¢,,-RE condition (9.57). Suppose that the 
bound (9.56) holds with T, = cı «/ "£4. Show that the bound (9.57) holds with «’ = £ as long 
Act (1+a)* 


as n > c|S |? logd with c2 = 


Exercise 9.12 (/4-regularization and soft thresholding) Given observations from the linear 
model y = X6* + w, consider the M-estimator 


sa 1 1 
0 = in 4 -Ilall — (0, -XTy} + A,|lOll, $. 
gmin {3 ll ( > 5) | n} 


(a) Show that the optimal solution is always unique, and given by = T,,(4X7y), where the 
soft-thresholding operator T}, was previously defined (7.6b). 
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(b) Now suppose that 6* is s-sparse. Show that if 


vef -e E 


then the optimal solution satisfies the bound o- Ol, < 2 YsA,. 

(c) Now suppose that the covariates {x;}7_, are drawn i.i.d. from a zero-mean y-sub-Gaussian 
ensemble with covariance cov(x;) = Ig, and the noise vector w is bounded as ||w]|2 < 
b vn for some b > 0. Show that with an appropriate choice of A,,, we have 


T i logd 
P-e < 3» okb Vs { ES È +o] 


with probability at least 1 — 4e~" for all 6 € (0, 1). 


o0 


Exercise 9.13 (From €.. to {l1, &}-bounds) In the setting of Corollary 9.27, show that any 
optimal solution 8 that satisfies the £~-bound (9.65) also satisfies the following ¢,- and f>- 


error bounds 
ar 240 logd f 120 slo d 
(@- oh <—"s J" and -eh Z J 
K n n 


(Hint: Proposition 9.13 is relevant here.) 


Exercise 9.14 (£,,-curvature for Lipschitz cost functions) In the setting of regression-type 
data z = (x,y) € X x Y, consider a cost function whose gradient is elementwise L-Lipschitz: 
i.e., for any sample z and pair 6, 0, the jth partial derivative satisfies 


ALO;z;) OLO; z) 
6; 8 


< L|xi; (x 0- 0)|. (9.101) 


The goal of this exercise is to show that such a function satisfies an ¢,,-curvature condition 
similar to equation (9.64), as required for applying Corollary 9.27. 


(a) Show that for any GLM whose cumulant function has a uniformly bounded second 
derivative (|||. < B), the elementwise Lipschitz condition (9.101) is satisfied with 
L=& 


= 
(b) For a given radius r > 0 and ratio p > 0, define the set 


Alli 
<p, and ||All.. < rj, 
IlAlleo 


T(R; p) := {A eR? | 


and consider the random vector V € R? with elements 


m= ip ttle > fit Dp for f= bh 


where, for each fixed vector A, 


OL(O + A3z;) OL; zi OL +A) dL 
FAs) z Lan ae 2) e ) _ aL n 


J. j J 
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is a zero-mean random variable. For each A > 0, show that 


z etl] edb bfai 5 call) : 
i=1 


(c) Suppose that the covariates {x;}"_, are sampled independently, with each x;; following a 
zero-mean o--sub-Gaussian distribution. Show that for all t € (0, 0), 


P[IlVlloo = t] < 2d?e° 2", 
(d) Suppose that the population function L satisfies the €.,- curvature condition 
VLG" + A) - VEO > KllAlko forall A € T(r; p). 


Use this condition and the preceding parts to show that 


IVL +A) -VL 2 KllAllo — 16. Lo? 4/ OBd y for all A € T(r; p) 
n 


with probability at least 1 — e7484, 
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Matrix estimation with rank constraints 


In Chapter 8, we discussed the problem of principal component analysis, which can be 
understood as a particular type of low-rank estimation problem. In this chapter, we turn 
to other classes of matrix problems involving rank and other related constraints. We show 
how the general theory of Chapter 9 can be brought to bear in a direct way so as to obtain 
theoretical guarantees for estimators based on nuclear norm regularization, as well as various 
extensions thereof, including methods for additive matrix decomposition. 


10.1 Matrix regression and applications 


In previous chapters, we have studied various forms of vector-based regression, including 
standard linear regression (Chapter 7) and extensions based on generalized linear models 
(Chapter 9). As suggested by its name, matrix regression is the natural generalization of 
such vector-based problems to the matrix setting. The analog of the Euclidean inner product 
on the matrix space R“*“ is the trace inner product 


d dz 
(A, B) := trace(AB) = X` X Ajj. Bjyp- (10.1) 


j=1 pal 


This inner product induces the Frobenius norm |||Alllp = J ee i aCe )2, which is sim- 
ply the Euclidean norm on a vectorized version of the matrix. 

In a matrix regression model, each observation takes the form Z; = (X;, y;), where X; € 
Rx% is a matrix of covariates, and y; € R is a response variable. As usual, the simplest case 


is the linear model, in which the response—covariate pair are linked via the equation 


yi = (Xi, O") + wi, (10.2) 


1j2 


where w; is some type of noise variable. We can also write this observation model in a 
more compact form by defining the observation operator X,: R”* — R” with elements 
[¥,(0)]; = «X;, ©), and then writing 


y = X,(0") +w, (10.3) 


where y € R” and w € R” are the vectors of response and noise variables, respectively. The 
adjoint of the observation operator, denoted X*, is the linear mapping from R” to R“*“ given 
by uh 2, u;X;. Note that the operator ¥, is the natural generalization of the design matrix 
X, viewed as a mapping from R? to R” in the usual setting of vector regression. 
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As illustrated by the examples to follow, there are many applications in which the regres- 
sion matrix ©* is either low-rank, or well approximated by a low-rank matrix. Thus, if we 
were to disregard computational costs, an appropriate estimator would be a rank-penalized 
form of least squares. However, including a rank penalty makes this a non-convex form 
of least squares so that—apart from certain special cases—it is computationally difficult to 
solve. This obstacle motivates replacing the rank penalty with the nuclear norm, which leads 
to the convex program 


ae í 1 
Oc arg min {sb g ERCON + ‘ths : (10.4) 
OeRAX2 | 2n 
Recall that the nuclear norm of © is given by the sum of its singular values—namely, 
a 
WOlllnuc = > o(@), where d’ = min{d), do}. (10.5) 


j=l 


See Example 9.8 for our earlier discussion of this matrix norm. 


Let us illustrate these definitions with some examples, beginning with the problem of multi- 
variate regression. 


Example 10.1 (Multivariate regression as matrix regression) As previously introduced in 
Example 9.6, the multivariate regression observation model can be written as Y = ZO* + W, 
where Z € R”? is the regression matrix, and Y € R”*T is the matrix of responses. The tth 
column ©} , of the (pxT)-dimensional regression matrix ©* can be thought of as an ordinary 
regression vector for the tth component of the response. In many applications, these vectors 
lie on or close to a low-dimensional subspace, which means that the matrix ©* is low-rank, 
or well approximated by a low-rank matrix. A direct way of estimating ©* would be via 
reduced rank regression, in which one minimizes the usual least-squares cost ||Y — ZO\ll; 
while imposing a rank constraint directly on the regression matrix ©. Even though this 
problem is non-convex due to the rank constraint, it is easily solvable in this special case; 
see the bibliographic section and Exercise 10.1 for further details. However, this ease of 
solution is very fragile and will no longer hold if other constraints, in addition to a bounded 
rank, are added. In such cases, it can be useful to apply nuclear norm regularization in order 
to impose a “soft” rank constraint. 

Multivariate regression can be recast as a form of the matrix regression model (10.2) with 
N = nT observations in total. For each j = 1,...,n and £ = 1,...,T, let E; be ann x T 
mask matrix, with zeros everywhere except for a one in position (j, €). If we then define 
the matrix X; := ZTE j€ R?*T, the multivariate regression model is based on the N = nT 
observations (X jz, yje), each of the form 


yje = (Xje, OY + Wye, for j=1,...,nand€=1,...,T. 


Consequently, multivariate regression can be analyzed via the general theory that we develop 
for matrix regression problems. 4 


Another example of matrix regression is the problem of matrix completion. 


Example 10.2 (Low-rank matrix completion) Matrix completion refers to the problem of 
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estimating an unknown matrix ©* € R“*® based on (noisy) observations of a subset of its 
entries. Of course, this problem is ill-posed unless further structure is imposed, and so there 
are various types of matrix completion problems, depending on this underlying structure. 
One possibility is that the unknown matrix has a low-rank, or more generally can be well 
approximated by a low-rank matrix. 


As one motivating application, let us consider the “Netflix problem”, in which the rows 
of ©* correspond to people, and columns correspond to movies. Matrix entry ©% , represents 
the rating assigned by person a (say “‘Alice”) to a given movie b that she has seen. In this 
setting, the goal of matrix completion is to make recommendations to Alice—that is, to 
suggest other movies that she has not yet seen but would be to likely to rate highly. Given 
the large corpus of movies stored by Netflix, most entries of the matrix ©* are unobserved, 
since any given individual can only watch a limited number of movies over his/her lifetime. 
Consequently, this problem is ill-posed without further structure. See Figure 10.1(a) for an 
illustration of this observation model. Empirically, if one computes the singular values of 
recommender matrices, such as those that arise in the Netflix problem, the singular value 
spectrum tends to exhibit a fairly rapid decay—although the matrix itself is not exactly low- 
rank, it can be well-approximated by a matrix of low rank. This phenomenon is illustrated 
for a portion of the Jester joke data set (Goldberg et al., 2001), in Figure 10.1(b). 


Spectral decay for Jester Joke 
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Figure 10.1 (a) Illustration of the Netflix problem. Each user (rows of the matrix) 
rates a subset of movies (columns of the matrix) on a scale of | to 5. All remaining 
entries of the matrix are unobserved (marked with +), and the goal of matrix comple- 
tion is to fill in these missing entries. (b) Plot of the singular values for a portion of 
the Jester joke data set (Goldberg et al., 2001), corresponding to ratings of jokes on 
a scale of [-10, 10], and available at http: //eigentaste.berkeley.edu/. Al- 
though the matrix is not exactly low-rank, it can be well approximated by a low-rank 
matrix. 


In this setting, various observation models are possible, with the simplest being that we 
are given noiseless observations of a subset of the entries of ©*. A slightly more realistic 
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model allows for noisiness—for instance, in the linear case, we might assume that 


pas, Wi 
Vi = Oaoh) + E (10.6) 
142 


where w; is some form! of observation noise, and (a(i), b(i)) are the row and column indices 
of the ith observation. 


How to reformulate the observations as an instance of matrix regression? For sample 
index i, define the mask matrix X; € R“*“, which is zero everywhere except for posi- 
tion (a(i), b@)), where it takes the value dd). Then by defining the rescaled observation 
yi := Vdd, ¥;, the observation model can be written in the trace regression form as 


yi = (Xi, O") + wi. (10.7) 


We analyze this form of matrix completion in the sequel. 

Often, matrices might take on discrete values, such as for yes/no votes coded in the set 
{-1, 1}, or ratings belonging to some subset of the positive integers (e.g., {1,...,5}), in which 
case a generalized version of the basic linear model (10.6) would be appropriate. For in- 
stance, in order to model binary-valued responses y € {—1, 1}, it could be appropriate to use 
the logistic model 


o Xi 0) 


PO; | Xi, O*) = T+ atx, oy F ei kX, 8)" 


(10.8) 


In this context, the parameter ©* , is proportional to the log-odds ratio for whether user a 
likes (or dislikes) item b. & 


We now turn to the matrix analog of the compressed sensing observation model, originally 
discussed in Chapter 7 for vectors. It is another special case of the matrix regression problem. 


Example 10.3 (Compressed sensing for low-rank matrices) Working with the linear ob- 
servation model (10.3), suppose that the design matrices X; € R“*” are drawn i.i.d from a 
random Gaussian ensemble. In the simplest of settings, the design matrix is chosen from the 
standard Gaussian ensemble, meaning that each of its D = d,d2 entries is an i.i.d. draw from 
the N(0, 1) distribution. In this case, the random operator X,, provides n random projections 
of the unknown matrix ®@*—namely 


yi = (X;, ©) fori=1,...,n. (10.9) 


In this noiseless setting, it is natural to ask how many such observations suffice to recover 
©* exactly. We address this question in Corollary 10.9 to follow in the sequel. & 


The problem of signal phase retrieval leads to a variant of the low-rank compressed sensing 
problem: 


Example 10.4 (Phase retrieval) Let 6* € R? be an unknown vector, and suppose that we 
make measurements of the form y; = |(x;, 6°)) where x; ~ N(0, I4) is a standard normal 
vector. This set-up is a real-valued idealization of the problem of phase retrieval in image 


' Our choice of normalization by 1/yd;d; is for later theoretical convenience, as clarified in the sequel—see 
equation (10.36). 
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processing, in which we observe the magnitude of complex inner products, and want the 
retrieve the phase of the associated complex vector. In this idealized setting, the “phase” can 
take only two possible values, namely the possible signs of (x;, 6*). 

A standard semidefinite relaxation is based on lifting the observation model to the space 
of matrices. Taking squares on both sides yields the equivalent observation model 


¥ = (x, HF = (x; @x;, OBO") fori=1,...,n, 


where 6* @ @* = 6*(8*)" is the rank-one outer product. By defining the scalar observation 
yi =, as well as the matrices X; := x; ® x; and ©* := 6* & 6", we obtain an equivalent 
version of the noiseless phase retrieval problem—namely, to find a rank-one solution to 
the set of matrix-linear equations y; = «X;, ©*)) fori=1,...,n. This problem is non- 
convex, but by relaxing the rank constraint to a nuclear norm constraint, we obtain a tractable 
semidefinite program (see equation (10.29) to follow). 

Overall, the phase retrieval problem is a variant of the compressed sensing problem from 
Example 10.3, in which the random design matrices X; are no longer Gaussian, but rather 
the outer product x; ® x; of two Gaussian vectors. In Corollary 10.13 to follow, we show that 
the solution of the semidefinite relaxation coincides with the rank-constrained problem with 
high probability given n = d observations. & 


Matrix estimation problems also arise in modeling of time series, where the goal is to de- 
scribe the dynamics of an underlying process. 


Example 10.5 (Time-series and vector autoregressive processes) A vector autoregressive 
(VAR) process in d dimensions consists of a sequence of d-dimensional random vectors 
{z'}_, that are generated by first choosing the random vector z! € R? according to some 
initial distribution, and then recursively setting 


Zl =0z +w,  fort=1,2,...,N-1. (10.10) 


Here the sequence of d-dimensional random vectors {w}! forms the driving noise of the 
process; we model them as i.i.d. zero-mean random vectors with covariance F > 0. Of 
interest to us is the matrix @* € R” that controls the dependence between successive 
samples of the process. Assuming that w‘ is independent of z' for each f, the covariance 
matrix X’ = cov(z’) of the process evolves according to the recursion X*! := @*X'(O*)' +T. 
Whenever |||©*||l2 < 1, it can be shown that the process is stable, meaning that the eigenvalues 
of X' stay bounded independently of t, and the sequence {2’}*, converges to a well-defined 
limiting object. (See Exercise 10.2.) 

Our goal is to estimate the system parameters, namely the d-dimensional matrices ©* 
and I’. When the noise covariance IF is known and strictly positive definite, one possible 
estimator for ©* is based on a sum of quadratic losses over successive samples—namely, 


1 S t+1 ty)2 
o ao - @2'|7.,, 10.11 
L£,(0) 2N È liz z lli- (10.11) 


where |jallr-: := /(a, F~!a} is the quadratic norm defined by T. When the driving noise w’ is 
zero-mean Gaussian with covariance I, then this cost function is equivalent to the negative 
log-likelihood, disregarding terms not depending on ©*. 
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In many applications, among them subspace tracking and biomedical signal processing, 
the system matrix ©* can be modeled as being low-rank, or well approximated by a low-rank 
matrix. In this case, the nuclear norm is again an appropriate choice of regularizer, and when 
combined with the loss function (10.11), we obtain another form of semidefinite program to 
solve. 

Although different on the surface, this VAR observation model can be reformulated as a 
particular instance of the matrix regression model (10.2), in particular one with n = d(N-1) 
observations in total. At each time t = 2,..., N, we receive a total of d observations. Letting 
e; € R? denote the canonical basis vector with a single one in position j, the jth observation 
in the block has the form 


* t-1 =l -1 * =b 
z = (ej, z) = (ep O2') + wh! = Kejo z, OT) +wi, 


so that in the matrix regression observation model (10.2), we have y; = (z,); and X; = e jaz 
when i indexes the sample (t, j). 4 


10.2 Analysis of nuclear norm regularization 


Having motivated problems of low-rank matrix regression, we now turn to the development 
and analysis of M-estimators based on nuclear norm regularization. Our goal is to bring to 
bear the general theory from Chapter 9. This general theory requires specification of certain 
subspaces over which the regularizer decomposes, as well as restricted strong convexity 
conditions related to these subspaces. This section is devoted to the development of these 
two ingredients in the special case of nuclear norm (10.5). 


10.2.1 Decomposability and subspaces 


We begin by developing appropriate choices of decomposable subspaces for the nuclear 
norm. For any given matrix @ € R“'*“, we let rowspan(@) c R® and colspan(@) c R+ de- 
note its row space and column space, respectively. For a given positive integer r < d’ := 
min{d;, d2}, let U and Y denote r-dimensional subspaces of vectors. We can then define the 
two subspaces of matrices 


M(U, Y) := {O € R“*” | rowspan(®) c Y, colspan(@) c U} (10.12a) 
and 
M+(U, V) := {O € R“* | rowspan(®) c Y+, colspan(®) c U+}. (10.12b) 


Here U+ and Y+ denote the subspaces orthogonal to U and Y, respectively. When the sub- 
spaces (U, V) are clear from the context, we omit them so as to simplify notation. From the 
definition (10.12a), any matrix in the model space M has rank at most r. On the other hand, 
equation (10.12b) defines the subspace M(U, V) implicitly, via taking the orthogonal com- 
plement. We show momentarily that unlike other regularizers considered in Chapter 9, this 
definition implies that M(U, V) is a strict superset of M(U, V). 

To provide some intuition for the definition (10.12), it is helpful to consider an explicit 
matrix-based representation of the subspaces. Recalling that d’ = min{d, dz}, let U € RA” 
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and V € R®*’ be a pair of orthonormal matrices. These matrices can be used to define 
r-dimensional spaces: namely, let U be the span of the first r columns of U, and similarly, 
let V be the span of the first r columns of V. In practice, these subspaces correspond (re- 
spectively) to the spaces spanned by the top r left and right singular vectors of the target 
matrix ©*. 

With these choices, any pair of matrices A € M(U, Y) and B € M+(U, Y) can be repre- 
sented in the form 


T: 1 Oxa- 


A=U 
Ow —»xr Ow —nx(a’-1) 


Vv" and B= v| Tes Mape r| vT, (10.13) 
Ow —»xr Ty 

where Ti; € R?” and Py, € R“~*4~ are arbitrary matrices. Thus, we see that M corre- 
sponds to the subspace of matrices with non-zero left and right singular vectors contained 
within the span of first r columns of U and V, respectively. 

On the other hand, the set M+ corresponds to the subspace of matrices with non-zero left 
and right singular vectors associated with the remaining d’ — r columns of U and V. Since 
the trace inner product defines orthogonality, any member A of M(U, V) must take the form 


A=U E a vT, (10.14) 


where all three matrices I), € R°”, Py. € R*®- and Do; € RØ- are arbitrary. In this 
way, we see explicitly that M is a strict superset of M whenever r < d’. An important fact, 
however, is that M is not substantially larger than M. Whereas any matrix in M has rank at 
most r, the representation (10.14) shows that any matrix in M has rank at most 2r. 

The preceding discussion also demonstrates the decomposability of the nuclear norm. 
Using the representation (10.13), for an arbitrary pair of matrices A € M and B € M+, we 


have 
Ti o T x 
0 0 -lo ze h 
IA Ilue + IIBllhauc, 


Gi) 
where steps (i) and (ii) use the invariance of the nuclear norm to orthogonal transformations 
corresponding to multiplication by the matrices U or V, respectively. 

When the target matrix ©* is of rank r, then the “best” choice of the model subspace 
(10.12a) is clear. In particular, the low-rank condition on ©* means that it can be factored 
in the form @* = UDV", where the diagonal matrix D € R”*” has the r non-zero singular 
values of ©* in its first r diagonal entries. The matrices U € R“*“ and V € R®* are 
orthonormal, with their first r columns corresponding to the left and right singular vectors, 
respectively, of ©*. More generally, even when ©* is not exactly of rank r, matrix subspaces 
of this form are useful: we simply choose the first r columns of U and V to index the singular 
vectors associated with the largest singular values of ©*, a subspace that we denote by 
MU", Vv’). 

With these details in place, let us state for future reference a consequence of Proposi- 
tion 9.13 for M-estimators involving the nuclear norm. Consider an M-estimator of the form 


© arg B {£,(0) E AnlllOlllnuc} > 


(i) 
IIA + Blllauc = 


0 0 


nuc 
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where L, is some convex and differentiable cost function. Then for any choice of regular- 
ization parameter 4, > 2(||V-L£,,(©*)|llz, the error matrix A = ©— ©* must satisfy the cone-like 
constraint 


Asi llnuc < 3llApilllauc + 4O nuc (10.15) 


where M = M(U’, Y”) and M = M(U”, Y^). Here the reader should recall that Ap denotes the 
projection of the matrix A onto the subspace M, with the other terms defined similarly. 


10.2.2 Restricted strong convexity and error bounds 


We begin our exploration of nuclear norm regularization in the simplest setting, namely 
when it is coupled with a least-squares objective function. More specifically, given observa- 
tions (y, X,,) from the matrix regression model (10.3), consider the estimator 


ate i 1 
© carg min (Zv - X, (O)| + Attn: (10.16) 


Oc R1*42 


where 4, > 0 is a user-defined regularization parameter. As discussed in the previous section, 
the nuclear norm is a decomposable regularizer and the least-squares cost is convex, and so 
given a suitable choice of A,, the error matrix A := ©- © must satisfy the cone-like 
constraint (10.15). 

The second ingredient of the general theory from Chapter 9 is restricted strong convexity 
of the loss function. For this least-squares cost, restricted strong convexity amounts to lower 
bounding the quadratic form Baal: In the sequel, we show the random operator X,, satisfies 
a uniform lower bound of the form 


IIXn(ADIL5 (di + do) 
n 


z IIAIIŻ for all A € R&*, (10.17) 
n 


nuc?’ 


K 
> =|I|AlZ - c 
zll lllz — Co 


with high probability. Here the quantity x > O is a curvature constant, and co is another 
universal constant of secondary importance. In the notation of Chapter 9, this lower bound 
implies a form of restricted strong convexity—in particular, see Definition 9.15—with cur- 
vature x and tolerance TŽ = co dedd, We then have the following corollary of Theorem 9.19: 


WT 


Proposition 10.6 Suppose that the observation operator X,, satisfies the restricted 
strong convexity condition (10.17) with parameter x > 0. Then conditioned on the 
event G(A,) = {Ill È; WiXillle < a), any optimal solution to nuclear norm regularized 


least squares (10.16) satisfies the bound 


T 1 a -Pahar E P 
NO-@'lp< 547 + Haa ` o(@") + C | 
j=r+1 j=r+1 


(10.18) 


valid for anyr € {1,...,d'} such that r < TEO OTT 
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Remark: As with Theorem 9.19, the result of Proposition 10.6 is a type of oracle inequality: 
it applies to any matrix ©*, and involves a natural splitting into estimation and approximation 
error, parameterized by the choice of r. Note that the choice of r can be optimized so as to 
obtain the tightest possible bound. 

The bound (10.18) takes a simpler form in special cases. For instance, suppose that 
rank(@*) < d’ and moreover that n > 128 2 rank(©*) (d; + d2). We then may apply the 
bound (10.18) with r = rank(®*). Since X% , 7 (©*) = 0, Proposition 10.6 implies the 
upper bound 


j=r+ 


IO - @*Iẹ < E rank(®"). (10.19) 


We make frequent use of this simpler bound in the sequel. 


Proof For each r e€ {1,...,d’}, let (U”, Y”) be the subspaces spanned by the top r left 
and right singular vectors of @*, and recall the subspaces M(U’, Y”) and M+(U”, V”) pre- 
viously defined in (10.12). As shown previously, the nuclear norm is decomposable with 
respect to any such subspace pair. In general, the “good” event from Chapter 9 is given 
by GA,) = {®*(VL,(0*)) < a From Table 9.1, the dual norm to the nuclear norm is the 
€,-operator norm. For the least-squares cost function, we have VL,(@*) = 1 X;-1 WiXi, SO 
that the statement of Proposition 10.6 involves the specialization of this event to the nuclear 
norm and least-squares cost. 

The assumption (10.17) is a form of restricted strong convexity wap tolerance param- 
eter TŻ = co ath . It only remains to verify the condition TPM) < gq. The representa- 
tion (10.14) teveals that any matrix © € M(U”, V’) has rank at most 2r, and hence 


Olu 2 
@eM(U",V")\{0} IOl 


P(MU”, V’)) := 


Consequently, the final condition of Theorem 9.19 holds whenever the target rank r is 
bounded as in the statement of Proposition 10.6, which completes the proof. 


10.2.3 Bounds under operator norm curvature 


In Chapter 9, we also proved a general result—namely, Theorem 9.24—that, for a given 
regularizer ®, provides a bound on the estimation error in terms of the dual norm ®*. Recall 
from Table 9.1 that the dual to the nuclear norm is the £,-operator norm or spectral norm. 
For the least-squares cost function, the gradient is given by 


VL,(O) = - PRE - «X;, @)) = = ey ; (v-%,)), 


where X*: R” — R“* is the adjoint operator. Consequently, in this particular case, the 
*-curvature condition from Definition 9.22 takes the form 


Ben 
I XnXn(Alla 2 KMAN = Tr MAn for all A € [Rae (10.20) 


where x > 0 is the curvature parameter, and T, > 0 is the tolerance parameter. 
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Proposition 10.7 Suppose that the observation operator X, satisfies the curvature 
condition (10.20) with parameter k > 0, and consider a matrix ®* with rank(©*) < wee 
Then, conditioned on the event G(A,) = (IZ 82 wll < an}, any optimal solution to the 
M-estimator (10.16) satisfies the bound 


a AR 
I® - O*ll < 3 V2 Z. (10.21) 
K 


Remark: Note that this bound is smaller by a factor of yr than the Frobenius norm 
bound (10.19) that follows from Proposition 10.6. Such a scaling is to be expected, since 
the Frobenius norm of a rank-r matrix is at most yr times larger than its operator norm. 
The operator norm bound (10.21) is, in some sense, stronger than the earlier Frobenius 
norm bound. More specifically, in conjunction with the cone-like inequality (10.15), in- 
equality (10.21) implies a bound of the form (10.19). See Exercise 10.5 for verification of 
these properties. 


Proof In order to apply Theorem 9.24, the only remaining condition to verify is the in- 
equality T, ¥7(IM) < 35: We have previously calculated that P? (M) < 2r, so that the stated 
upper bound on r ensures that this inequality holds. 


10.3 Matrix compressed sensing 


Thus far, we have derived some general results on least squares with nuclear norm regular- 
ization, which apply to any model that satisfies the restricted convexity or curvature condi- 
tions. We now turn to consequences of these general results for more specific observation 
models that arise in particular applications. Let us begin this exploration by studying com- 
pressed sensing for low-rank matrices, as introduced previously in Example 10.3. There we 
discussed the standard Gaussian observation model, in which the observation matrices X; € 
Rx are drawn i.i.d., with all entries of each observation matrix drawn i.i.d. from the stan- 
dard Gaussian N(0, 1) distribution. More generally, one might draw random observation ma- 
trices X; with dependent entries, for instance with vec(X;) ~ N(0, £), where £ € R244) 
is the covariance matrix. In this case, we say that X; is drawn from the £-Gaussian ensemble. 

In order to apply Proposition 10.6 to this ensemble, our first step is to establish a form of 
restricted strong convexity. The following result provides a high-probability lower bound on 
the Hessian of the least-squares cost for this ensemble. It involves the quantity 


PŒ) := sup var(€X, uv"). 
lull2=llvll2=1 


Note that p?°(14) = 1 for the special case of the identity ensemble. 
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Theorem 10.8 Given n i.i.d. draws {X;}"_, of random matrices from the X-Gaussian 
ensemble, there are positive constants cı < 1 < cz such that 


LAI 
EO y o YVE vA- PSEA) pae vae Ri (10.22 


with probability at least 1 — a 


n 
32 


= 


This result can be understood as a variant of Theorem 7.16, which established a similar result 
for the case of sparse vectors and the ¢,-norm. As with this earlier theorem, Theorem 10.8 
can be proved using the Gordon—Slepian comparison lemma for Gaussian processes. In Ex- 
ercise 10.6, we work through a proof of a slightly simpler form of the bound. 


Theorem 10.8 has an immediate corollary for the noiseless observation model, in which 
we observe (y;, X;) pairs linked by the linear equation y; = «X;, ©*)). In this setting, the 
natural analog of the basis pursuit program from Chapter 7 is the following convex program: 


min |Olllac such that (X;, © = y; for alli=1,...,n. (10.23) 
OER! 


That is, we search over the space of matrices that match the observations perfectly to find the 
solution with minimal nuclear norm. As with the estimator (10.16), it can be reformulated 
as an instance of semidefinite program. 


Corollary 10.9 Givenn > 162 = a r (dı +d) i.i.d. samples from the X-ensemble, the 


estimator (10.23) recovers the rank-r matrix ©* exactly—i.e., it has a unique solution 
© = O*—with probability at least 1 — i E 


eg 


The requirement that the sample size n is larger than r (dı + d2) is intuitively reasonable, as 
can be seen by counting the degrees of freedom required to specify a rank-r matrix of size 
dı X dy. Roughly speaking, we need r numbers to specify its singular values, and rd, and 
rd numbers to specify its left and right singular vectors.” Putting together the pieces, we 
conclude that the matrix has of the order r(d; + d2) degrees of freedom, consistent with the 
corollary. Let us now turn to its proof. 


Proof Since © and ©* are optimal and feasible, respectively, for the program (10.23), we 
have Olllauc < IlO" auc = IlO} lhu- Introducing the error matrix A = 0- @*, we have 


— pe is — (i) “= ~ 
IlOll = IO* + Allau = IO% + Ap + Avilllnuc 2 IO% + Are llnau = IlArillnuc 


by the triangle inequality. Applying decomposability this yields ||Oy, + Agi linue = 


2 The orthonormality constraints for the singular vectors reduce the degrees of freedom, so we have just given 
an upper bound here. 
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Ov llnuc + IlAp llu. Combining the pieces, we find that [Ayullhuc < [llAyilllnuc- From the 
representation (10.14), any matrix in M has rank at most 2r, whence 


Allue < 2lAgillmc < 2 V2r NAllr- (10.24) 


Now let us condition on the event that the lower bound (10.22) holds. When applied to A, 
and coupled with the inequality (10.24), we find that 


a 


+d 
th) IAI > Syn) NAIR, 


>{c Vmin (E) — 8 C2 PE) 


where the final inequality follows by applying the given lower bound on n, and performing 
some algebra. But since both © and ©* are feasible for the convex program (10.23), we have 


shown that 0 = LOE a + Ymin (>) IAI, which implies that ‘A = 0 as claimed. 


Theorem 10.8 can also be used to establish bounds for the least-squares estimator (10.16), 
based on noisy observations of the form y; = (X;, ©*))+w;. Here we state and prove a result 
that is applicable to matrices of rank at most r. 


Corollary 10.10 Consider n > 642 £- D r (dı + d) i.i.d. samples (yi, Xi) from the 


linear matrix regression model, where each X; is drawn from the X-Gaussian ensem- 


ble. Then any optimal solution to the program (10.16) with A, = 100 p(Z)(4/ ath +ô) 
satisfies the bound 


IO - "II; < 125 


e (Lre p r} (10.25) 


Cr 1 Vain) 


with probability at least 1 — 2e ™””. 


Figure 10.2 provides plots of the behavior predicted by Corollary 10.10. We generated 
these plots by simulating matrix regression problems with design matrices X; chosen from 
the standard Gaussian ensemble, and then solved the convex program (10.16) with the choice 
of A, given in Corollary 10.10, and matrices of size d x d, where d? € € {400, 1600, 6400} and 
rank r = = [Vd d]. In Figure 10.2(a), we plot the Frobenius norm error | — ©*|llp, averaged 
over T = 10 trials, versus the raw sample size n. Each of these error plots tends to zero as the 
sample size increases, showing the classical consistency of the method. However, the curves 
shift to the right as the matrix dimension d (and hence the rank r) is increased, showing the 
effect of dimensionality. Assuming that the scaling of Corollary 10.10 is snap it predicts 
that, if we plot the same Frobenius errors versus the rescaled sample size *, then all three 
curves should be relatively well aligned. These rescaled curves are shown i in Tisu 10.2(b): 
consistent with the prediction of Corollary 10.10, they are now all relatively well aligned, 
independently of the dimension and rank, consistent with the prediction. 


Let us now turn to the proof of Corollary 10.10. 
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Figure 10.2 Plots of the Frobenius norm error | — ©' |p for the nuclear norm 
regularized least-squares estimator (10.16) with design matrices X; drawn from the 
standard Gaussian ensemble. (a) Plots of Frobenius norm error versus sample size n 


for three different matrix sizes d € {40, 80, 160} and rank r = [ Vd]. (b) Same error 
measurements now plotted against the rescaled sample size 4. As predicted by the 
theory, all three curves are now relatively well-aligned. 


Proof We prove the bound (10.25) via an application of Proposition 10.6, in particular 
in the form of the bound (10.19). Theorem 10.8 shows that the RSC condition holds with 
K = cı and co = sa so that the stated lower bound on the sample size ensures that 
Proposition 10.6 can be applied with r = rank(@*). 


It remains to verify that the event G(A,) = {ll "1 wXill2 < a) holds with high proba- 


bility. Introduce the shorthand Q = + X; w:X;, and define the event & = {le < 207, We 
then have 


P[IQIk = =| < P(E] + P[IQIl > A l8]. 


Since the noise variables {w;}""_, are i.i.d., each zero-mean and sub-Gaussian with parameter 
o, we have P[E°] < e™’8. It remains to upper bound the second term, which uses condition- 
ing on &. 

Let {u',...,u™} and {v!,..., v} be 1/4-covers in Euclidean norm of the spheres S47! 
and S“-!, respectively. By Lemma 5.7, we can find such covers with M < 9 and N < 9% 
elements respectively. For any v € S®~!, we can write v = vf + A for some vector A with £2 
at most 1/4, and hence 


1 t 
IIQli2 = sup lQvll2 < ZllQlll, + max ||Qv'|l2. 


veSt2-! t=1,...,.N 


A similar argument involving the cover of S%-! yields ||Qv‘|| < HIQI + max (ui : Qv’). 
jolts 


Thus, we have established that 


IlQl <2 max max |Z"| where Z/! = (wi, Qv’). 


j=1,..,M €=1,....N 
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Fix some index pair (j, £): we can then write Z>! = 1 yy" w;¥/" where Y/" = (w, Xiv 2) 
Note that each variable ea is zero-mean Gaussian with variance at most p?(Z). Conse- 


x ips 5 : x 207? P(X 
quently, the variable Z/’ is zero-mean Gaussian with variance at most "72, where we 


have used the conditioning on event &. Putting together the pieces, we conclude that 


Phi So wxm = $< ue Ize Z] 


PEN 


< OET X z tlog M+log N 


< Fe -a +(dı+d2) log 2 


Setting 2, = 10rpŒ)( Jee +6), we find that P[I} D”, wXill > $] < 2e” as 
claimed. 


Corollary 10.10 is stated for matrices that are exactly low-rank. However, Proposition 10.6 
can also be used to derive error bounds for matrices that are not exactly low-rank, but rather 
well approximated by a low-rank matrix. For instance, suppose that ©* belongs to the £4- 
“ball” of matrices given by 


d 
B,(R,) := fo ERA | X (a (O))* < r). (10.26) 
j=l 
where q € [0, 1] is a parameter, and R, is the radius. Note that this is simply the set of matri- 
ces whose vector of singular values belongs to the usual €,-ball for vectors. See Figure 9.5 
for an illustration. 

When the unknown matrix ©* belongs to B,(R,), Proposition 10.6 can be used to show 
that the estimator (10.16) satisfies an error bound of the form 


o? (di +d)\"? 
n 


IO - ©*IÈ x R, (10.27) 
with high probability. Note that this bound generalizes Corollary 10.10, since in the special 
case q = 0, the set Bo(r) corresponds to the set of matrices with rank at most r. See Exer- 
cise 10.7 for more details. 


As another extension, one can move beyond the setting of least squares, and consider 
more general non-quadratic cost functions. As an initial example, still in the context of 
matrix regression with samples z = (X, y), let us consider a cost function that satisfies a 
local L-Lipschitz condition of the form 


|£; z) - LO;2)| < L KO, X) - KO, X)| forall ©, € B,(R). 


For instance, if the response variables y were binary-valued, with the conditional distribution 
of the logistic form, as described in Example 9.2, then the log-likelihood would satisfy this 
condition with L = 2 (see Example 9.33). Similarly, in classification problems based on 
matrix-valued observations, the hinge loss that underlies the support vector machine would 
also satisfy this condition. In the following example, we show how Theorem 9.34 can be used 
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to establish restricted strong convexity with respect to the nuclear norm for such Lipschitz 
losses. 


Example 10.11 (Lipschitz losses and nuclear norm) As a generalization of Corollary 10.10, 
suppose that the dı x d) design matrices {X;}?_, are generated i.i.d. from a v-sub-Gaussian 
ensemble, by which we mean that, for each pair of unit-norm vectors (u,v), the random 
variable (u, X; v) is zero-mean and v-sub-Gaussian. Note that the &-Gaussian ensemble is a 
special case with v = p(%). 

Now recall that 


&,(A) = L, (0* T A) ma L, (0°) z «VL,(0*), A» 


denotes the error in the first-order Taylor-series expansion of the empirical cost function, 
whereas &(A) denotes the analogous quantity for the population cost function. We claim that 
for any 6 > 0, any cost function that is L-Lipschitz over the ball B;-(1) satisfies the bound 


E,(A) — E(A)| < 16L v [lAllnuc {2 e + + for all A € Br(1/d,1) (10.28) 
n 


with probability at least 1 — 4(log d)? e's. 
In order to establish the bound (10.28), we need to verify the conditions of Theorem 9.34. 
For a matrix @ € R, recall that we use {0 (O) to denote its singular values. The dual to 


rr 


we need to study the deviations of the random variable Fe Di1 &Xill2, where {e;}'_, is an 
iid. sequence of Rademacher variables. Since the random matrices {X;}?_, are i.i.d., this 
random variable has the same distribution as ||| V|ll2, where V is a v/-/n-sub-Gaussian random 
matrix. By the same discretization argument used in the proof of Corollary 10.10, for each 
A > 0, we have Efel] < 5%, X3 Efe”2"], where M < 9% and N < 9%, and each 
random variable Z/’ is sub-Gaussian with parameter at most V2v/n. Consequently, for 
any ó > 0, 


x F 87a nô 
inf E [eAMlla-8)) <MN infe™ as — A162 +9(di +42) 
Aa>0 Aa>0 


Setting 6? = 1447 42 + y’¢? yields the claim (10.28). & 


10.4 Bounds for phase retrieval 


We now return to the problem of phase retrieval. In the idealized model previously intro- 
duced in Example 10.4, we make n observations of the form y; = |(x;, 6*)|, where the obser- 
vation vector x; ~ N(0, I4) are drawn independently. A standard lifting procedure leads to 
the semidefinite relaxation 


Oc arg min trace(®) such that y = (0, x;®x;) foralli=1,...,n. (10.29) 
Oes? 

This optimization problem is known as a semidefinite program (SDP), since it involves op- 

timizing over the cone S*“ of positive semidefinite matrices. By construction, the rank-one 

matrix ©* = 6 ® & is feasible for the optimization problem (10.29), and our goal is to 
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understand when it is the unique optimal solution. Equivalently, our goal is to show that the 
error matrix A = © — O* is equal to zero. 


Defining the new response variables y; = y and observation matrices X; := x; ® x;, the 
constraints in the SDP (10.29) can be written in the equivalent trace inner product form 
yi = (X;, ©). Since both © and ©* are feasible and hence must satisfy these constraints, we 
see that the error matrix A must belong to the nullspace of the linear operator X,,: R% — R” 
with components [X,,(@)]; = «X;, @)). The following theorem shows that this random oper- 
ator satisfies a version of the restricted nullspace property (recall Chapter 7): 


Theorem 10.12 (Restricted nullspace/eigenvalues for phase retrieval) For each i = 
1,...,”, consider random matrices of the form X; = x; ® x; for i.i.d. N(O,1q) vectors. 
Then there are universal constants (co, C\, C2) such that for any p > 0, a sample size 
n > copd suffices to ensure that 


1x 1 
Si ©)’ > 5 llOll; for all matrices such that |I®|IŻ < pllO\l2,,.. (10.30) 


SCOT 


with probability at least 1 — cie 
4 


Note that the lower bound (10.30) implies that there are no matrices in the intersection of 
nullspace of the operator ¥„ with the matrix cone defined by the inequality IOI? < pllOll.,.- 

Consequently, Theorem 10.12 has an immediate corollary for the exactness of the semi- 
definite programming relaxation (10.29): 


Corollary 10.13 Given n > 2cod samples, the SDP (10.29) has the unique optimal 
solution © = O* with probability at least 1 — ce. 


Proof Since © and ©* are optimal and feasible (respectively) for the convex program 
(10.29), we are guaranteed that trace(@) < trace(@*). Since both matrices must be positive 
semidefinite, this trace constraint is equivalent to NOl < < |O*lllnuc. This inequality, in 
conjunction with the rank-one nature of ©* and the decomposability of the nuclear norm, 
implies that the error matrix A = © — © satisfies the cone constraint [Alllauc < < v2 2 Alle. 
Consequently, we can apply Theorem 10.12 with p = 2 to conclude that 


1x hs T 
=- ) KX; A) > IAB, 
nok ) > zla 


from which we conclude that A = 0 with the claimed probability. 


Let us now return to prove Theorem 10.12. 


Proof For each matrix A € S“*“, consider the (random) function f(X, v) = vX, AYJ, 
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where v € {—1,1} is a Rademacher variable independent of X. By construction, we then 
have E[f,aCX, v)] = 0. Moreover, as shown in Exercise 10.9, we have 
IA = E[€X, AX? = MAIÈ + 2(trace(A))’. (10.3 1a) 


As a consequence, if we define the set A1( VP) = {A € S®% | |l|Alllauc < vØ |llAlllr}, it suffices 
to show that 


~ SUX, Ay? > I F[¢&X, A)?] forall A € A\(-yp) (10.31b) 
i=1 


ILfall3 
lalà 
with probability at least 1 — ce ®”. 

We prove claim (10.31b) as a corollary of a more general one-sided uniform law, stated as 
Theorem 14.12 in Chapter 14. First, observe that the function class F := {fa | A € Ai(yp)} 
is a cone, and so star-shaped around zero. Next we claim that the fourth-moment condi- 
tion (14.22b) holds. From the result of Exercise 10.9, we can restrict attention to diagonal 
matrices without loss of generality. It suffices to show that E[ RX, v)] < C for all matrices 
such that IDIÈ = i D? jS 1. Since the Gaussian variables have moments of all orders, by 
Rosenthal’s inequality (see Exercise 2.20), there is a universal constant c such that 


d d d 
ELK vi] = EX, Dah] < cf X DUEL + (D7 DF ELD}. 
j=l Fl j=l 


For standard normal variates, we have E [x4] = 4 and E [x$] = 105, whence 


d 
EAX] < ef105 X D$; + 161D). 
j=l 


Under the condition Bei D; ; < 1, this quantity is bounded by a universal constant C, thereby 
verifying the moment condition (14.22b). 

Next, we need to compute the local Rademacher complexity, and hence the critical ra- 
dius 6, As shown by our previous calculation (10.31a), the condition || f,ll2 < 6 implies that 
I||Allle < 6. Consequently, we have 

= i 
R,(6) < E - i 3 Vi 
os a pS 
Alle <ô 


where {£;}_; is another i.i.d. Rademacher sequence. Using the definition of fa and the duality 
between the operator and nuclear norms (see Exercise 9.4), we have 


_ 1# 
Ril) < | sup IIC >) £: (4; @ xD) MAn 


Achi(yp) P 


1 n 
< ypô E n ` Ej (xi ® wl 


i=l 
Finally, by our previous results on operator norms of random sub-Gaussian matrices (see 
Theorem 6.5), there is a constant c such that, in the regime n > d, we have 


[Y neoze] se of 


i=1 
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Putting together the pieces, we conclude that inequality (14.24) is satisfied for any 6, X 


ve 2, Consequently, as long as n > copd for a sufficiently large constant co, we can set 
6, = 1/2 in Theorem 14.12, which establishes the claim (10.31b). 


10.5 Multivariate regression with low-rank constraints 


The problem of multivariate regression, as previously introduced in Example 10.1, involves 
estimating a prediction function, mapping covariate vectors z € R?” to output vectors y € RT. 
In the case of linear prediction, any such mapping can be parameterized by a matrix ©* € 
R?*T, A collection of n observations can be specified by the model 


Y = ZO" +W, (10.32) 


where (Y, Z) € R’? x IR?*? are observed, and W € R®T is a matrix of noise variables. For 
this observation model, the least-squares cost takes the form £,(@) = +Y — ZOIŽ. 

The following result is a corollary of Proposition 10.7 in application to this model. It is 
applicable to the case of fixed design and so involves the minimum and maximum eigen- 
values of the sample covariance matrix Les ZZ, 


Corollary 10.14 Consider the observation model (10.32) in which ©* € RP% has 
rank at most r, and the noise matrix W has i.i.d. entries that are zero-mean and o--sub- 


Gaussian. Then any solution to the program (10.16) with A, = 100 Ymax E) (a we +ô) 


satisfies the bound 
T y| Ymax E) 
| p+T , o) 


I® - ©*llz < 30 V2 = 
Ymin(X) n 


with probability at least 1 — 2e-2"* | Moreover, we have 


I® — ©*lle < 4 V2r]O - ©*||. and MO- O° luc < 3270 - ©". (10.34) 
d 


(10.33) 


Note that the guarantee (10.33) is meaningful only when n > p, since the lower bound 
Ymin(&) > 0 cannot hold otherwise. However, even if the matrix @* were rank-one, it would 
have at least p + T degrees of freedom, so this lower bound is unavoidable. 


Proof We first claim that condition (10.20) holds with x = Ymin(2) and Tn = 0. We have 
VL,(0) = IZ" — Z@), and hence VL,(O* + A) — VL,,(@*) = XA where X = EE is the 
sample covariance. Thus, it suffices to show that 


INEAN > Ymin@UlAll, for all A € R&T. 
For any vector u € RT, we have \[ZAull> > Ymin()||Aull>, and thus 


ZAI, sup |EAull > YminCE) sup ||Aull> = Ymin(Z) IAll2, 


llulla=1 llull2=1 
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which establishes the claim. 

It remains to verify that the inequality ||VL,,(©*)ll. < % holds with high probability under 
the stated choice of 4,. For this model, we have VL,(@*) = +Z™W, where W € R”*7 is a 
zero-mean matrix of i.i.d. o-sub-Gaussian variates. As shown in Exercise 10.8, we have 


! = d+T 
phiiz'wi > se Yous) (4] —— + 6) 


from which the validity of 4, follows. Thus, the bound (10.33) follows from Proposition 10.7. 

Turning to the remaining bounds (10.34), with the given choice of 2, the cone inequal- 
ity (10.15) guarantees that Apella < 3ll\Agilluc. Since any matrix in M has rank at most 
2r, we conclude that Alllauc < 4 VITIA 2r(l|Alllz. Consequently, the nuclear norm bound in equa- 


tion (10.34) follows from the Frobenius norm bound. We have 


<22’, (10.35) 


a2 eos. es — (ii) ~ = 
NAIR = KA, A) < WAIA < 4-V2r[Allle WA, 


where step (i) follows from Hölder’s inequality, and step (ii) follows from our previous 
bound. Canceling out a factor of |||Alllp from both sides yields the Frobenius norm bound in 
equation (10.34), thereby completing the proof. 


10.6 Matrix completion 


Let us now return to analyze the matrix completion problem previously introduced in Ex- 
ample 10.2. Recall that it corresponds to a particular case of matrix regression: observations 
are of the form y; = KX;, ©*)) + w;, where X; € R“*® is a sparse mask matrix, zero every- 
where except for a single randomly chosen entry (a(i), b(i)), where it is equal to Vdıdz. The 
sparsity of these regression matrices introduces some subtlety into the analysis of the matrix 
completion problem, as will become clear in the analysis to follow. 

Let us now clarify why we chose to use rescaled mask matrices X;—that is, equal to 
Vdd) instead of 1 in their unique non-zero entry. With this choice, we have the convenient 
relation 


*,@ 15] _ 1x 
|! ( tl-15 E[(X;, ©*)?] = NO'I, (10.36) 


n Msz 
using the fact that each entry of ©* is picked out with probability (d,dz)"!. 

The calculation (10.36) shows that, for any unit-norm matrix @*, the squared Euclidean 
norm of ||X,(@*)||2/-Vn has mean one. Nonetheless, in the high-dimensional setting of inter- 
est, namely, when n « did), there are many non-zero matrices ©* of low rank such that 
X,,(©*) = 0 with high probability. This phenomenon is illustrated by the following example. 
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Example 10.15 (Troublesome cases for matrix completion) Consider the matrix 


i 30> 70 0 
000 0 

@ = ¢,@¢,=|0 0 0 + Of, (10.37) 
000+. 0 


which is of rank one. Let ¥„: R% — R” be the random observation operation based on n 
i.i.d. draws (with replacement) of rescaled mask matrices X;. As we show in Exercise 10.3, 
we have X,,(@>*) = 0 with probability converging to one whenever n = o(d’). & 


Consequently, if we wish to prove non-trivial results about matrix completion in the 
regime n < dı dy, we need to exclude matrices of the form (10.37). One avenue for doing 
so is by imposing so-called matrix incoherence conditions directly on the singular vectors 
of the unknown matrix @* € R”*®%, These conditions were first introduced in the context 
of numerical linear algebra, in which context they are known as leverage scores (see the 
bibliographic section for further discussion). Roughly speaking, conditions on the leverage 
scores ensure that the singular vectors of ©* are relatively “spread out”. 

More specifically, consider the singular value decomposition @* = UDV", where D is a 
diagonal matrix of singular values, and the columns of U and V contain the left and right 
singular vectors, respectively. What does it mean for the singular values to be spread out? 
Consider the matrix U € R“* of left singular vectors. By construction, each of its d4- 
dimensional columns is normalized to Euclidean norm one; thus, if each singular vector 
were perfectly spread out, then each entry would have magnitude of the order 1/Vd,. As a 
consequence, in this ideal case, each r-dimensional row of U would have Euclidean norm 
exactly Vr/d;. Similarly, the rows of V would have Euclidean norm yr/d in the ideal case. 

In general, the Euclidean norms of the rows of U and V are known as the left and right 
leverage scores of the matrix @*, and matrix incoherence conditions enforce that they are 
relatively close to the ideal case. More specifically, note that the matrix UUT € R@*“' has 
diagonal entries corresponding to the squared left leverage scores, with a similar observation 
for the matrix VV" € R®*“:. Thus, one way in which to control the leverage scores is via 
bounds of the form 
galma SH x and IVV" ~ Zadali < M x, 
where u > Ois the incoherence parameter. When the unknown matrix ©* satisfies conditions 
of this type, it is possible to establish exact recovery results for the noiseless version of the 
matrix completion problem. See the bibliographic section for further discussion. 

In the more realistic setting of noisy observations, the incoherence conditions (10.38) 
have an unusual property, in that they have no dependence on the singular values. In the 
presence of noise, one cannot expect to recover the matrix exactly, but rather only an estimate 
that captures all “significant” components. Here significance is defined relative to the noise 
level. Unfortunately, the incoherence conditions (10.38) are non-robust, and so less suitable 
in application to noisy problems. An example is helpful in understanding this issue. 


UUT - (10.38) 


Example 10.16 (Non-robustness of singular vector incoherence) Define the d-dimensional 
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vector z = [o 1 1: 1], and the associated matrix Z* := (z ® z)/d. By construction, 
the matrix Z* is rank-one, and satisfies the incoherence conditions (10.38) with constant u. 
But now suppose that we “poison” this incoherent matrix with a small multiple of the “bad” 
matrix from Example 10.15, in particular forming the matrix 


I’ =(1-6)Z* +60 for some 6 € (0, 1]. (10.39) 


As long as 6 > 0, then the matrix I’ has e, € R? as one of its eigenvectors, and so violates 
the incoherence conditions (10.38). But for the non-exact recovery results of interest in a 
statistical setting, very small values of 6 need not be a concern, since the component d@> 
has Frobenius norm 6, and so can be ignored. & 


There are various ways of addressing this deficiency of the incoherence conditions (10.38). 
Possibly the simplest is by bounding the maximum absolute value of the matrix, or rather 
in order to preserve the scale of the problem, by bounding the ratio of the maximum value 
to its Frobenius norm. More precisely, for any non-zero matrix ®© € R“*“, we define the 
spikiness ratio 


V didz |Ol|max 
Olle 


where || - ||max denotes the elementwise maximum absolute value. By definition of the Fro- 
benius norm, we have 


A(@) = (10.40) 


dı d 


OU: = >, X, Oj < dids Olka 


j=l k=l 


so that the spikiness ratio is lower bounded by 1. On the other hand, it can also be seen 
that @,(@) < dd), where this upper bound is achieved (for instance) by the previously 
constructed matrix (10.37). Recalling the “poisoned” matrix (10.39), note that unlike the in- 
coherence condition, its spikiness ratio degrades as 6 increases, but not in an abrupt manner. 
In particular, for any 6 € [0, 1], we have a,,(I™) < CA, 

The following theorem establishes a form of restricted strong convexity for the random op- 
erator that underlies matrix completion. To simplify the theorem statement, we adopt the 
shorthand d = dı + do. 


Theorem 10.17 Let ¥,: R“*“ — R” be the random matrix completion operator 
formed by n i.i.d. samples of rescaled mask matrices X;. Then there are universal posi- 
tive constants (c1, C2) such that 


2 
1 %0 | lOl  [dlogd 5 dlogd 
Se EO + c, a (O) ——_ +0 
n NOÈ 1 (O) TO, n D Ale 


(10.41) 


dlog d—n6é 


for all non-zero © € R“*®, uniformly with probability at least 1 — 2e73 
~ 
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MOllnue 
Olle 
sure of the rank. For any rank-r matrix, we have B(@) < yr, but in addition, there are many 


other higher-rank matrices that also satisfy this type of bound. On the other hand, recall 
the “bad” matrix © from Example 10.15. Although it has rank one, its spikiness ratio 
is maximal—that is, a,,(0) = d. Consequently, the bound (10.41) does not provide any 
interesting guarantee until n > d°. This prediction is consistent with the result of Exer- 
cise 10.3. 


In order to interpret this claim, note that the ratio 6(@) := serves as a “weak” mea- 


Before proving Theorem 10.17, let us state and prove one of its consequences for noisy 
matrix completion. Given n i.i.d. samples y; from the noisy linear model (10.6), consider the 
nuclear norm regularized estimator 


© € arg n da dab j- Bunso +20 (10.42) 
loe’ Jia. 


where Theorem 10.17 motivates the addition of the extra side constraint on the infinity norm 
of ©. As before, we use the shorthand notation d = dı + d. 


Corollary 10.18 Consider the observation model (10.6) for a matrix ©* with rank at 
most r, elementwise bounded as ||O*|lnax < «/Vdıd, and i.i.d. additive noise variables 
{w;};_; that satisfy the Bernstein condition with parameters (o, b). Given a sample size 
n> 100% dlog d, if we solve the program (10.42) with A? = 25 peta + 5° for some 
6€ (0, x), then any optimal solution O satisfies the bound 

IO - ©*IÊÈ < cı max{o’, a°} r aes 4 a} (10.43) 


with probability at least 1 — e~ Td — Qe z4logd n, 


Remark: Note that the bound (10.43) implies that the squared Frobenius norm is small 
as long as (apart from a logarithmic factor) the sample size n is larger than the degrees of 
freedom in a rank-r matrix—namely, r (dı + d2). 


Proof We first verify that the good event G(A,) = {|IVZ,(O*)Il2 < du} holds with high 
probability. Under the observation model (10.6), the gradient of the least-squares objec- 
tive (10.42) is given by 


VL,(@") =- 7 Di) E; -50X 


where we recall the rescaled mask matrices X; := ydd E;. From our calculations in Ex- 
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ample 6.18, we have? 
1< ee 42 
Pl X wX > e| < 4a ew < da eae, 
n 
i=l 


where the second inequality holds for any € > 0 such that be < o°. Under the stated lower 
bound on the sample size, we are guaranteed that bd, < o°, from which it follows that the 
event G(A,,) holds with the claimed probability. 

Next we use Theorem 10.17 to verify a variant of the restricted strong convexity condition. 
Under the event G(A,,), Proposition 9.13 implies that the error matrix A = © - @* satisfies 
the constraint 1Alllanc < AllAplllnuc- As noted earlier, any matrix in M has rank at most 2r, 


whence Alllanc < 4v2r Alle. By construction, we also have [All max < ae Putting to- 
142 


gether the pieces, Theorem 10.17 implies that, with probability at least 1 — 2e724!°84-"®, the 
observation operator X,, satisfies the lower bound 


~ 2. 
EAO rdlogd ~ dlogd 
S > MAIÈ — 8 V2 cia Š [Alle — 4c20° = +ô 


n 


~ T rdlogd dlogd 
> [Alle fian 8 Bev 4 = | 8ce? | = +e), (10.44) 


In order to complete the proof using this bound, we only need to consider two possible 
cases. 


Case 1: On one hand, if either 


rdlogd 


n 


IAlle < 16 V2ciæ 


~ dlogd 
IAI < 64c20° (“ee + *), 
n 
then the claim (10.43) follows. 


Case 2: Otherwise, we must have 


T dlogd _ NA dlos d A2 
Il-8 Veio f 84 n MAE ang sen (TES +a) < EE 
n n 


and hence the lower bound (10.44) implies that 


IBO 1a las lan 
> A Alls = —|||Allls. 
—— > sill: — IAM: = zA 


This is the required restricted strong convexity condition, and so the proof is then complete. 


Finally, let us return to prove Theorem 10.17. 


3 Here we have included a factor of 8 (as opposed to 2) in the denominator of the exponent, to account for the 
possible need of symmetrizing the random variables w;. 
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Proof Given the invariance of the inequality to rescaling, we may assume without loss of 
generality that |||Olllp = 1. For given positive constants (a, p), define the set 


S(@,p) = fo e RO | [Olle = 1, [Ollmax < and [Ollu < o}, (10.45) 


04 
Vdd 


as well as the associated random variable Z(œ, p) := sUPgesiap) 
by showing that there are universal constants (c1, c2) such that 


P| Z(æ, p) > Cl i [dlogd R ee dlogdy 
4 n 4 n 


Here our choice of the rescaling by 1/4 is for later theoretical convenience. Our proof of this 
bound is divided into two steps. 


1X, O} — 1|. We begin 


eg Hes. (10.46) 


Concentration around mean: Introducing the convenient shorthand notation F@(X) := 
«O, X))*, we can write 


1 n 
Z(a,r)= sup |- >, FoX) - ELFo(X)]} 
1 


@cS(a,p) 


so that concentration results for empirical processes from Chapter 3 can be applied. In par- 
ticular, we will apply the Bernstein-type bound (3.86): in order to do, we need to bound 
\|Follmax and var(F'@(X)) uniformly over the class. On one hand, for any rescaled mask ma- 
trix X and parameter matrix © € S(«, r), we have 


2 
(04 
[Fo(X)] < lOllinax IXI < -— didə = 0°, 


max Les didz 


where we have used the fact that ||X| R = d,d> for any rescaled mask matrix. Turning to the 
variance, we have 


var(Fo(X)) < EIF X)] < v E[Fo(X)] = 0’, 


a bound which holds for any © € S(«æ, p). Consequently, applying the bound (3.86) with 
e = | and t = d logd, we conclude that there are universal constants (c1, c2) such that 


Z(a,p) > 2EIZ(a,r)| + Saf Ded 2 we < emoe, (10.47) 
n n 


P 


Bounding the expectation: It remains to bound the expectation. By Rademacher sym- 
metrization (see Proposition 4.11), we have 


n 


sup |E X eX, o| < 4a E | sup |= J eX, o|, 


OcS(ap) N 4 OcS(ap) N 


L[Z(a@, p)] < 2E 


where inequality (ii) follows from the Ledoux—Talagrand contraction inequality (5.61) for 
Rademacher processes, using the fact that KO, X;))| < a for all pairs (O, X;). Next we apply 
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Hdlder’s inequality to bound the remaining term: more precisely, since |||Olllnu. < p for any 
© € S(«, p), we have 


< p efnt Ș aX]; 


i=1 


1 n 
sup K 1 6X; ©)| 


OES(a,p) i=l 


Finally, note that each matrix ¢;X; is zero-mean, has its operator norm upper bounded as 
lle:X;ll2 < Vdd. < d, and its variance bounded as 


1 
Il var(e:Xlle = Ta ie 8 Dl = Vaid. 
1 


Consequently, the result of Exercise 6.10 implies that 
1x no” 
PIII- Xil > 6| < 2 c; 
[m> 2 EX;ll2 > ô] dexp4 5 nE 5} 


4 we find that 


n? 


iy d 4d log(2d) © dlogd 
[m> X &Xil2] < 2 < | Viog@2d) + vz) + ORE S 164 e 


i=1 


Next, applying the result of Exercise 2.8(a) with C = 2d, ⁄? = d and B = 


Here the inequality (i) uses the fact that n > d log d. Putting together the pieces, we conclude 


that 
c dlogd 
E[Z(a,p)] < Lap | 8E, 
16 n 


for an appropriate definition of the universal constant c1. Since p > 1, the claimed bound 
(10.46) follows. 

Note that the bound (10.46) involves the fixed quantities (œ, p), as opposed to the arbi- 
trary quantities ( Vdıd2l|l@llmax; Il®llnuc) that would arise in applying the result to an arbitrary 
matrix. Extending the bound (10.46) to the more general bound (10.41) requires a technique 
known as peeling. 


Extension via peeling: Let Br(1) denote the Frobenius ball of norm one in R“*“@, and let 
& be the event that the bound (10.41) is violated for some @ € B-(1). For k,€ = 1,2,..., let 
us define the sets 


Spe = (O € Be) | 2°! < dll@llmax < 2* and 2° < []Ollnuc < 2°}, 


and let &;¢ be the event that the bound (10.41) is violated for some © € Sze. We first claim 
that 


& 


IN 


M 
y Exe, where M = flog d]. (10.48) 
k,t=1 


Indeed, for any matrix © € S(a,), we have 


Olle 2 Olle = 1 and [!Olllnuc < VdidllOlli < d. 


Thus, we may assume that ||[Olllnue € [1,d] without loss of generality. Similarly, for any 
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matrix of Frobenius norm one, we must have d]|Ollmax > Vd1do||O|lmax > 1 and d||Ollmax < d, 
showing that we may also assume that d||O||max € [1, d]. Thus, if there exists a matrix © of 
Frobenius norm one that violates the bound (10.41), then it must belong to some set S4 ¢ for 
k,€=1,2...,M, with M = flog d]. 

Next, for œ = 2* and p = 2°, define the event 


2 
Z DTE ci m [dlogd A C2 s [dlogd l 
i 4 n 4 n 


We claim that Exe S Ere. Indeed, if event Eze occurs, then there must exist some © € Se 
such that 


2 
1 dlogd dlogd 
[IKON — 1] > c1dllOllmax ll Ollaue 4f —= + (4 10 a È ) 
n n n 
2 
> ¢2klye! Meet eet jee 
n n 
2 
> Sagt jdlogd | c2 7k dlogd l 
4 n 4 n 


showing that Ere occurs. 
Putting together the pieces, we have 


M 
@) ~ ü l 
P[E] < >) PlEx¢] < Me desd < ett 
ke=1 
where inequality (i) follows from the union bound applied to the inclusion & € Uk Se i 


inequality (ii) is a consequence of the earlier tail bound (10.46); and inequality (iii) follows 
since log M? = 2loglogd < }dlogd. 


10.7 Additive matrix decompositions 


In this section, we turn to the problem of additive matrix decomposition. Consider a pair of 
matrices A* and I’, and suppose that we observe a vector y € R” of the form 


y = X, (A* +I") +w, (10.49) 


where X,, is a known linear observation operator, mapping matrices in R“*® to a vector in 
R”. In the simplest case, the observation operator performs a simple vectorization—that is, 
it maps a matrix M to the vectorized version vec(M). In this case, the sample size n is equal 
to the product dd, of the dimensions, and we observe noisy versions of the sum A* + I™. 
How to recover the two components based on observations of this form? Of course, this 
problem is ill-posed without imposing any structure on the components. One type of struc- 
ture that arises in various applications is the combination of a low-rank matrix A* with a 
sparse matrix I“. We have already encountered one instance of this type of decomposi- 
tion in our discussion of multivariate regression in Example 9.6. The problem of Gaussian 
graphical selection with hidden variables, to be discussed at more length in Section 11.4.2, 


338 Matrix estimation with rank constraints 


provides another example of a low-rank and sparse decomposition. Here we consider some 
additional examples of such matrix decompositions. 


Example 10.19 (Factor analysis with sparse noise) Factor analysis is a natural general- 
ization of principal component analysis (see Chapter 8 for details on the latter). In factor 
analysis, we have i.i.d. random vectors z € R? assumed to be generated from the model 


Zi = Lu; + Ei, for i = 1,2, onal ,N, (10.50) 


where L € R®“ is a loading matrix, and the vectors u; ~ N(0,I,) and s; ~ N(0,T*) are in- 
dependent. Given n i.i.d. samples from the model (10.50), the goal is to estimate the loading 
matrix L, or the matrix LL" that projects onto the column span of L. A simple calculation 
shows that the covariance matrix of Z; has the form E = LL" + I’. Consequently, in the 
special case when I* = oIy, then the range of L is spanned by the top r eigenvectors of È, 
and so we can recover it via standard principal components analysis. 

In other applications, we might no longer be guaranteed that I™ is the identity, in which 
case the top r eigenvectors of X need not be close to the column span of L. Nonetheless, 
when I“ is a sparse matrix, the problem of estimating LL’ can be understood as an instance 
of our general observation model (10.3) with n = d’. In particular, letting the observation 
vector y € R” be the vectorized version of the sample covariance matrix > DXi zz, then 
some algebra shows that y = vec(A* + I“) + vec(W), where A* = LL? is of rank r, and the 
random matrix W is a Wishart-type noise—viz. 


1 ~ T * 
W:= Hye @2) (LL +T*}. (10.51) 


When I™ is assumed to be sparse, then this constraint can be enforced via the elementwise 
€\-norm. & 


Other examples of matrix decomposition involve the combination of a low-rank matrix with 
a column or row-sparse matrix. 


Example 10.20 (Matrix completion with corruptions) Recommender systems, as previ- 
ously discussed in Example 10.2, are subject to various forms of corruption. For instance, in 
2002, the Amazon recommendation system for books was compromised by a simple attack. 
Adversaries created a large number of false user accounts, amounting to additional rows in 
the matrix of user-book recommendations. These false user accounts were populated with 
strong positive ratings for a spiritual guide and a sex manual. Naturally enough, the end ef- 
fect was that those users who liked the spiritual guide would also be recommended to read 
the sex manual. 

If we again model the unknown true matrix of ratings as being low-rank, then such adver- 
sarial corruptions can be modeled in terms of the addition of a relatively sparse component. 
In the case of the false user attack described above, the adversarial component I* would be 
relatively row-sparse, with the active rows corresponding to the false users. We are then led 
to the problem of recovering a low-rank matrix A* based on partial observations of the sum 
AMET: & 
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As discussed in Chapter 6, the problem of covariance estimation is fundamental. A robust 
variant of the problem leads to another form of matrix decomposition, as discussed in the 
following example: 


Example 10.21 (Robust covariance estimation) For i = 1,2,...,N, let u; € R? be sam- 
ples from a zero-mean distribution with unknown covariance matrix A*. When the vectors 
u; are observed without any form of corruption, then it is straightforward to estimate A* by 
performing PCA on the sample covariance matrix. Imagining that j € {1,2,...,d} indexes 
different individuals in the population, now suppose that the data associated with some sub- 
set S of individuals is arbitrarily corrupted. This adversarial corruption can be modeled by 
assuming that we observe the vectors z; = u; + yi fori = 1,...,N, where each y; € R 
is a vector supported on the subset S. Letting E= DIAG ® zi) be the sample covari- 
ance matrix of the corrupted samples, some algebra shows that it can be decomposed as 
E = A* +A + W, where W := x ZX (u; ® u;) — A* is again a type of recentered Wishart 
noise, and the remaining term can be written as 


ix 1 Š 
A := N 2 @ yi) + N 2, (ui 8 yi + yi @ ui). (10.52) 


Thus, defining y = vec®), we have another instance of the general observation model with 
n = d’—namely, y = vec(A* + A) + vec(W). 

Note that A itself is not a column-sparse or row-sparse matrix; however, since each vector 
v; € Rf is supported only on some subset $ c {1,2,...,d}, we can write A=I* +(I")', 
where I™ is a column-sparse matrix with entries only in columns indexed by S . This structure 
can be enforced by the use of the column-sparse regularizer, as discussed in the sequel. & 


Finally, as we discuss in Chapter 11 to follow, the problem of Gaussian graphical model 
selection with hidden variables also leads to a problem of additive matrix decomposition 
(see Section 11.4.2). 

Having motivated additive matrix decompositions, let us now consider efficient methods 
for recovering them. For concreteness, we focus throughout on the case of low-rank plus 
elementwise-sparse matrices. First, it is important to note that—like the problem of matrix 
completion—we need somehow to exclude matrices that are simultaneously low-rank and 
sparse. Recall the matrix ©! from Example 10.16: since it is both low-rank and sparse, 
it could be decomposed either as a low-rank matrix plus the all-zeros matrix as the sparse 
component, or as a sparse matrix plus the all-zeros matrix as the low-rank component. 

Thus, it is necessary to impose further assumptions on the form of the decomposition. 
One possibility is to impose incoherence conditions (10.38) directly on the singular vectors 
of the low-rank matrix. As noted in Example 10.16, these bounds are not robust to small 
perturbations of this problem. Thus, in the presence of noise, it is more natural to consider 
a bound on the “spikiness” of the low-rank component, which can be enforced by bounding 
the maximum absolute value over its elements. Accordingly, we consider the following es- 
timator: 
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ee f 1 
T,^A)=arg min fy - T +All + An(IIE + alll) ) ; (10.53) 
TER x2 2 


ME gaa 
lAlima < 7 


It is parameterized by two regularization parameters, namely 4, and w,. The following corol- 
lary provides suitable choices of these parameters that ensure the estimator is well behaved; 
the guarantee is stated in terms of the squared Frobenius norm error 


eA - A*, T-T”) := A - A*È + IE - N. (10.54) 


Corollary 10.22 Suppose that we solve the convex program (10.53) with parameters 


21W 
E A eae (10.55) 


vVdıdz — Ah 
Then there are universal constants c; such that for any matrix pair (A*, T*) with ||A*||max < 
—— and for all integers r = 1,2,...,min{d,, d2} and s = 1,2,..., (dd), the squared 


Frobenius error (10.54) is upper bounded as 


min{d; ,d>} 


A 


c1 w2 22 {r + 
j=r+1 n GOES 


oA} tea {s+ a z ei (10.56) 


WnAn 


where S is an arbitrary subset of matrix indices of cardinality at most s. 
d 


As with many of our previous results, the bound (10.56) is a form of oracle inequality, 
meaning that the choices of target rank r and subset S can be optimized so as to achieve 
the tightest possible bound. For instance, when the matrix A* is exactly low-rank and I™ is 
sparse, then setting r = rank(A*) and S = supp(I*) yields 


eA- A’, T -T*) < Ai{e; w rank(A*) + c2 |suppd)|}. 
In many cases, this inequality yields optimal results for the Frobenius error of the low-rank 


plus sparse problem. We consider a number of examples in the exercises. 


Proof We prove this claim as a corollary of Theorem 9.19. Doing so requires three steps: 
(i) verifying a form of restricted strong convexity; (ii) verifying the validity of the regulariza- 
tion parameters; and (iii) computing the subspace Lipschitz constant from Definition 9.18. 


We begin with restricted strong convexity. Define the two matrices Ap =T-I* and 


Aq i= ASA corresponding to the estimation error in the sparse and low-rank components, 
respectively. By expanding out the quadratic form, we find that the first-order error in the 
Taylor series is given by 


En(Ag, Ag) = slg + Agile = 4 (MAFIE + MARIE +(Ag, Aq). 
—— 


e2(Az, Ap) 
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By the triangle inequality and the construction of our estimator, we have 


|Aq Imax < Allmax + IA” Imax < 


didz 
Combined with Hölder’s inequality, we see that 
1 2a 
En(Ag, Ag) 2 =e7(Ap, Ag) — = llAglli, 
To A 2 T A vdd yl 


so that restricted strong convexity holds with x = 1, but along with an extra error term. Since 
it is proportional to ||Ag||1, the proof of Theorem 9.19 shows that it can be absorbed without 
any consequence as long as A, > ae 

Verifying event G(/,,): A straightforward calculation gives VL,(I*, A*) = (W, W). From 
the dual norm pairs given in Table 9.1, we have 


(10.57) 


W 
D (VLA, A*)) — max [Wl Il b), 


n 


so that the choices (10.55) guarantee that 4, > 2@7, (V£,(I", A*)). 


Choice of model subspaces: For any subset $ of matrix indices of cardinality at most s, 
define the subset M(S) := {T € R4*® | T;; = 0 forall (i, j) ¢ S}. Similarly, for any r = 
1,...,min{d;, dz}, let U, and V, be (respectively) the subspaces spanned by the top r left 
and right singular vectors of A*, and recall the subspaces M(U,, V,) and M4(U,, V,) previ- 
ously defined in equation (10.12). We are then guaranteed that the regularizer ®,, (T, A) = 
[Illi + @alllAlllnuc is decomposable with respect to the model subspace M := M(S) x MU, V,) 
and deviation space M+(S) x M*(U,, V,). It then remains to bound the subspace Lipschitz 
constant. We have 


r lll Alllnuc r n V2r||A 
PM- sup Mht oAlue yp VSM + a V2rIAll 


TAMSA- AIDE +A E IEIÈ + MAIG 


V5 + wn VAr. 


Putting together the pieces, the overall claim (10.56) now follows as a corollary of Theo- 
rem 9.19. 


IA 
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In her Ph.D. thesis, Fazel (2002) studied various applications of the nuclear norm as a surro- 
gate for a rank constraint. Recht et al. (2010) studied the use of nuclear norm regularization 
for the compressed sensing variant of matrix regression, with noiseless observations and ma- 
trices X; € R*®% drawn independently, each with i.i.d. N(0, 1) entries. They established suf- 
ficient conditions for exact recovery in the noiseless setting (observation model (10.2) with 
w; = 0) when the covariates X; are drawn from the standard Gaussian ensemble (each entry 
of X; distributed as N (0, 1), drawn independently). In the noisy setting, this particular en- 
semble was also studied by Candès and Plan (2010) and Negahban and Wainwright (2011a), 
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who both gave sharp conditions on the required sample size. The former paper applies to 
sub-Gaussian but isotropic ensembles (identity covariance), whereas the latter paper estab- 
lished Theorem 10.8 that applies to Gaussian ensembles with arbitrary covariance matrices. 
Recht et al. (2009) provide precise results on the threshold behavior for the identity version 
of this ensemble. 

Nuclear norm regularization has also been studied for more general problem classes. 
Rohde and Tsybakov (2011) impose a form of the restricted isometry condition (see Chap- 
ter 7), adapted to the matrix setting, whereas Negabahn and Wainwright (201 1a) work with 
a milder lower curvature condition, corresponding to the matrix analog of a restricted eigen- 
value condition in the special case of quadratic losses. Rohde and Tsybakov (2011) also pro- 
vide bounds on the nuclear norm estimate in various other Schatten matrix norms. Bounds 
for multivariate (or multitask) regression, as in Corollary 10.14, have been proved by vari- 
ous authors (Lounici et al., 2011; Negahban and Wainwright, 201 1a; Rohde and Tsybakov, 
2011). The use of reduced rank estimators for multivariate regression has a lengthy his- 
tory; see Exercise 10.1 for its explicit form as well as the references (Izenman, 1975, 2008; 
Reinsel and Velu, 1998) for some history and more details. See also Bunea et al. (2011) for 
non-asymptotic analysis of a class of reduced rank estimators in multivariate regression. 

There are wide number of variants of the matrix completion problem; see the survey 
chapter by Laurent (2001) and references therein for more details. Srebro and his co-authors 
(2004; 2005a; 2005b) proposed low-rank matrix completion as a model for recommender 
systems, among them the Netflix problem described here. Srebro et al. (2005b) provide error 
bounds on the prediction error using nuclear norm regularization. Candés and Recht (2009) 
proved exact recovery guarantees for the nuclear norm estimator, assuming noiseless ob- 
servations and certain incoherence conditions on the matrix involving the leverage scores. 
Leverage scores also play an important role in approximating low-rank matrices based on 
random subsamples of its rows or columns; see the survey by Mahoney (2011) and ref- 
erences therein. Gross (2011) provided a general scheme for exact recovery based on a 
dual witness construction, and making use of Ahlswede—Winter matrix bound from Sec- 
tion 6.4.4; see also Recht (2011) for a relatively simple argument for exact recovery. Ke- 
shavan et al. (2010a; 2010b) studied both methods based on the nuclear norm (SVD thresh- 
olding) as well as heuristic iterative methods for the matrix completion problem, providing 
guarantees in both the noiseless and noisy settings. Negahban and Wainwright (2012) study 
the more general setting of weighted sampling for both exactly low-rank and near-low-rank 
matrices, and provided minimax-optimal bounds for the €,-“balls” of matrices with control 
on the “spikiness” ratio (10.40). They proved a weighted form of Theorem 10.17; the proof 
given here for the uniformly sampled setting is more direct. Koltchinski et al. (2011) assume 
that the sampling design is known, and propose a variant of the matrix Lasso. In the case of 
uniform sampling, it corresponds to a form of SVD thresholding, an estimator that was also 
analyzed by Keshavan et al. (2010a; 2010b). See Exercise 10.11 for some analysis of this 
type of estimator. 

The problem of phase retrieval from Section 10.4 has a lengthy history and various appli- 
cations (e.g., Grechberg and Saxton, 1972; Fienup, 1982; Griffin and Lim, 1984; Fienup and 
Wackerman, 1986; Harrison, 1993). The idea of relaxing a non-convex quadratic program to 
a semidefinite program is a classical one (Shor, 1987; Lovasz and Schrijver, 1991; Nesterov, 
1998; Laurent, 2003). The semidefinite relaxation (10.29) for phase retrieval was proposed 


10.9 Exercises 343 


by Chai et al. (2011). Candés et al. (2013) provided the first theoretical guarantees on exact 
recovery, in particular for Gaussian measurement vectors. See also Waldspurger et al. (2015) 
for discussion and analysis of a closely related but different SDP relaxation. 

The problem of additive matrix decompositions with sparse and low-rank matrices was 
first formalized by Chandrasekaran et al. (2011), who analyzed conditions for exact recov- 
ery based on deterministic incoherence conditions between the sparse and low-rank compo- 
nents. Candés et al. (2011) provided related guarantees for random ensembles with milder 
incoherence conditions. Chandrasekaran et al. (2012b) showed that the problem of Gaussian 
graphical model selection with hidden variables can be tackled within this framework; see 
Section 11.4.2 of Chapter 11 for more details on this problem. Agarwal et al. (2012) provide 
a general analysis of regularization-based methods for estimating matrix decompositions 
for noisy observations; their work uses the milder bounds on the maximum entry of the 
low-rank matrix, as opposed to incoherence conditions, but guarantees only approximate 
recovery. See Ren and Zhou (2012) for some two-stage approaches for estimating matrix 
decompositions. Fan et al. (2013) study a related class of models for covariance matrices 
involving both sparse and low-rank components. 


10.9 Exercises 


Exercise 10.1 (Reduced rank regression) Recall the model of multivariate regression from 
Example 10.1, and, for a target rank r < T < p, consider the reduced rank regression 
estimate 


a ; 1 
Orr := arg min {iv — zok}. 
OcR? (2n 
rank(®)<r 
Define the sample covariance matrix Lz = 1Z"Z, and the sample cross-covariance matrix 
Ey = 1Z"Y. Assuming that bee is invertible, show that the reduced rank estimate has the 
explicit form 
Orr = LDLyyVV', 
where the matrix V € R’*” has columns consisting of the top r eigenvectors of the matrix 


Lyzh57mzy- 


Exercise 10.2 (Vector autogressive processes) Recall the vector autoregressive (VAR) 
model described in Example 10.5. 


(a) Suppose that we initialize by choosing z! ~ N(0, £), where the symmetric matrix £ 
satisfies the equation 


r- L0)" -T =0. (10.58) 
Here I > 0 is the covariance matrix of the driving noise. Show that the resulting stochas- 
tic process {z’}*°, is stationary. 


(b) Suppose that there exists a strictly positive definite solution È to equation (10.58). Show 
that |||O* ll, < 1. 
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(c) Conversely, supposing that |||O*||, < 1, show that there exists a strictly positive definite 
solution X to equation (10.58). 


Exercise 10.3 (Nullspace in matrix completion) Consider the random observation opera- 
tor X, : R% — R formed by n i.i.d. draws of rescaled mask matrices (zero everywhere 
except for d in an entry chosen uniformly at random). For the “bad” matrix @>™ from equa- 
tion (10.37), show that P[X,(@>**) = 0] = 1 — o(1) whenever n = o(d). 


Exercise 10.4 (Cone inequalities | for nuclear norm) Suppose that Olas < lO" llnucs \ where 
@* is arank-r matrix. Show that A = @-@* satisfies the cone constraint Ap. lllnuc < |Aralllaucs 
where the subspace M+ was defined in equation (10.14). 


Exercise 10.5 (Operator norm bounds) 


(a) Verify the specific form (10.20) of the ®*-curvature condition. 
(b) Assume that ©* has rank r, and that © — ©* satisfies the cone constraint (10.15), where 
MCU, V) is specified by subspace U and Y of dimension r. Show that 


I® — ©* lle < 4 V2r II® — O' fp. 


Exercise 10.6 (Analysis of matrix compressed sensing) In this exercise, we work through 
part of the proof of Theorem 10.8 for the special case & = Ip, where D = dd). In particular, 
defining the set 


BC) (= fA € RA“ | MJANE = 1, WA < f}, 


for some t > 0, we show that 


, 1< 1 di dy 
Š -AXZ > -6 a 2 
ne | DMX Aye5-6 RE (4): 


with probability greater than 1 — e~”®/, (This is a weaker result than Theorem 10.8, but the 
argument sketched here illustrates the essential ideas.) 


(a) Reduce the problem to lower bounding the random variable 


Z,(t) := inf sup Té u;(X;, Ad). 


AcB(t) llullo P 


(b) Show that the expectation can be lower bounded as 


1 
EIZn()] > il EEllwila] — EWI] 2}, 


where w € R” and W e R“*® are populated with i.i.d. N(0, 1) variables. (Hint: The 
Gordon-Slepian comparison principle from Chapter 5 could be useful here.) 
(c) Complete the proof using concentration of measure and part (b). 


Exercise 10.7 (Bounds for approximately low-rank matrices) Consider the observation 
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model y = X,(@*) + w with w ~ N(0,o7I,), and consider the nuclear norm constrained 
estimator 


1 : 2 ; 
— — < * 
@ arg, min, (> TOR subject to [Olllnuc < IO" Ilauc- 


Suppose that ©* belongs to the £,-“ball” of near-low-rank matrices (10.26). 
In this exercise, we show that the estimate @ satisfies an error bound of the form (10.27) 
when the random operator X,, satisfies the lower bound of Theorem 10.8. 


(a) For an arbitrary r € {1,2,...,d}, let U and V be subspaces defined by the top r left and 
right singular vectors of ©", and consider the subspace M(U, V). Prove that the error 
matrix A satisfies the inequality 


d 
[Api lnc < 2 V2rilAll: + 2S 0 (O°). 


j=r+1 


(b) Consider an integer r € {1,...,d} such that n > Crd for some sufficiently large but 
universal constant C. Using Theorem 10.8 and part (a), show that 


IAI S max{7\(r), TRO} N Halle d IAr, 


e error 
E eS 


where T(r) := gyi Di 1 7 ;(©"*). (Hint: You may assume that an inequality of the 
form iI! Zf wiXill < o f£ holds.) 
(c) Specify a choice of r that trades off the estimation and approximation error optimally. 


Exercise 10.8 Under the assumptions of Corollary 10.14, prove that the bound (10.35) 
holds. 


Exercise 10.9 (Phase retrieval with Gaussian masks) Recall the real-valued phase retrieval 
problem, based on the functions fe(X) = «KX, ©)), for a random matrix X = x ® x with 
x ~ N(O,I,). 


(a) Letting © = U'DU denote the singular value decomposition of @, explain why the 
random variables f@(X) and fp(X) have the same distributions. 
(b) Prove that 


ELX] = NOÈ + 2( trace(@))’. 


Exercise 10.10 (Analysis of noisy matrix completion) In this exercise, we work through 
the proof of Corollary 10.18. 


(a) Argue that with the setting A, > II È; wiE;ll2, we are guaranteed that the error matrix 
A = © - ©* satisfies the bounds 


Al nuc aron 
ll al <2 V2r and |lAllmax < 2a. 
|All 
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(b) Use part (a) and results from the chapter to show that, with high probability, at least one 
of the following inequalities must hold: 


A I2 
Il|Alll: < 


LAIL NAIR 
c 2dlogd $ ping ed s |X n(ADII5 . Il lle 
2 n n n 4 
(c) Use part (c) to establish the bound. 


Exercise 10.11 (Alternative estimator for matrix completion) Consider the problem of 
noisy matrix completion, based on observations y; = (X;, ©* + wi, where X; € R?*4 is 
a d-rescaled mask matrix (i.e., with a single entry of d in one location chosen uniformly at 
random, and zeros elsewhere). Consider the estimator 


~ a ieee Š 
© = arg min [zio anO 2 yiXi)) + 21 


e [Rad 
(a) Show that the optimal solution Ois unique, and can be obtained by soft thresholding the 
singular values of the matrix M := 1 Èi yiXı. In particular, if UDV! denotes the SVD 
of M, then O-U [T,,(D)] VT, where T}, (D) is the matrix formed by soft thresholding 


the diagonal matrix of singular values D. 
(b) Suppose that the unknown matrix ©* has rank r. Show that, with the choice 


MUs! 7 


lx lx 
A, = 2 max |- Sw, X;)) (X;, O°) — KU, O*)} + 27 > wiXilll2, 
i=1 i=l 


the optimal solution © satisfies the bound 
~ 3 
[| — O'lle < — Vr An. 
ae) 


(c) Suppose that the noise vector w € R” has i.i.d. o-sub-Gaussian entries. Specify an ap- 
propriate choice of 4, that yields a useful bound on |||O — ©*|[r. 


11 


Graphical models for high-dimensional data 


Graphical models are based on a combination of ideas from both probability theory and 
graph theory, and are useful in modeling high-dimensional probability distributions. They 
have been developed and studied in a variety of fields, including statistical physics, spatial 
statistics, information and coding theory, speech processing, statistical image processing, 
computer vision, natural language processing, computational biology and social network 
analysis among others. In this chapter, we discuss various problems in high-dimensional 
statistics that arise in the context of graphical models. 


11.1 Some basics 


We begin with a brief introduction to some basic properties of graphical models, referring 
the reader to the bibliographic section for additional references. There are various types of 
graphical models, distinguished by the type of underlying graph used—directed, undirected, 
or ahybrid of the two. Here we focus exclusively on the case of undirected graphical models, 
also known as Markov random fields. These models are based on an undirected graph G = 
(V, E), which consists of a set of vertices V = {1,2,...,d} joined together by a collection 
of edges E. In the undirected case, an edge (j,k) is an unordered pair of distinct vertices 
jkeV. 

In order to introduce a probabilistic aspect to our models, we associate to each vertex 
j € V a random variable X;, taking values in some space X;. We then consider the distri- 
bution P of the d-dimensional random vector X = (X1, ..., X4). Of primary interest to us 
are connections between the structure of P, and the structure of the underlying graph G. 
There are two ways in which to connect the probabilistic and graphical structures: one based 
on factorization, and the second based on conditional independence properties. A classi- 
cal result in the field, known as the Hammersley—Clifford theorem, asserts that these two 
characterizations are essentially equivalent. 


11.1.1 Factorization 


One way to connect the undirected graph G to the random variables is by enforcing a certain 
factorization of the probability distribution. A clique C is a subset of vertices that are all 
joined by edges, meaning that (j,k) € E for all distinct vertices j,k € C. A maximal clique is 
a clique that is not a subset of any other clique. See Figure 11.1(b) for an illustration of these 
concepts. We use € to denote the set of all cliques in G, and for each clique C € C, we use 
Wc to denote a function of the subvector xc := (x;, j € C). This clique compatibility function 
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takes inputs from the Cartesian product space X© := &; ¿c X j and returns non-negative real 
numbers. With this notation, we have the following: 


E > 
Definition 11.1 The random vector (X,,..., Xa) factorizes according to the graph G 
if its density function p can be represented as 


Pai.. | [uca (11.1) 


Cet 


for some collection of clique compatibility functions Yc: X© — [0, œ). 


Here the density function is taken with respect either to the counting measure for discrete- 
valued random variables, or to some (possibly weighted) version of the Lebesgue measure 
for continuous random variables. As an illustration of Definition 11.1, any density that fac- 
torizes according to the graph shown in Figure 11.1(a) must have the form 


D(X1, +++ , X7) X Wy23(%1, X2, X3) W345(X3, X4, X5) Wae(X4, X6) W57(X5, X7). 


Figure 11.1 Illustration of basic graph-theoretic properties. (a) Subsets A and B are 
3-cliques, whereas subsets C and D are 2-cliques. All of these cliques are maximal. 
Each vertex is a clique as well, but none of these singleton cliques are maximal for 
this graph. (b) Subset S is a vertex cutset, breaking the graph into two disconnected 
subgraphs with vertex sets A and B, respectively. 


Without loss of generality—redefining the clique compatibility functions as necessary— 
the product over cliques can always be restricted to the set of all maximal cliques. However, 
in practice, it can be convenient to allow for terms associated with non-maximal cliques as 
well, as illustrated by the following. 


Example 11.2 (Markov chain factorization) The standard way of factoring the distribution 
of a Markov chain on variables (X,,...,X,) is as 


P(xi, +5 Xa) = pix) p | x1) +++ Paya-1)(Xa | Xa-1)s 


where p, denotes the marginal distribution of X4, and for j € {1,2,...,d— 1}, the term pj41); 
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denotes the conditional distribution of X;,, given X;. This representation can be understood 
as a special case of the factorization (11.1), using the vertex-based functions 


Wi(x1) = pı(xı) atvertex1 and wx;)=1 forall j=2,...,d, 
combined with the edge-based functions 
Wij, Xj) = Ppp | xp for j=1,...,d—1. 


But this factorization is by no means unique. We could just as easily adopt the symmetrized 
factorization yj(x;) = p;(x;) for all j = 1,...,d, and 


ieee, terse: 
P(X) P(X) 
where p; denotes the joint distribution over the pair (X;, Xx). & 


Example 11.3 (Multivariate Gaussian factorization) Any non-degenerate Gaussian dis- 
tribution with zero mean can be parameterized in terms of its inverse covariance matrix 
@* = X"!, also known as the precision matrix. In particular, its density can be written as 


vdet(@*) -1x7@*x 


P(%1,..-,%4; 0") = 

By expanding the quadratic form, we see that 
1 Te. 1 -10* x, 

eH! expt YOu) = [| te, 
GHEE GEE yaja) 


showing that any zero-mean Gaussian distribution can be factorized in terms of functions on 
edges, or cliques of size two. The Gaussian case is thus special: the factorization can always 
be restricted to cliques of size two, even if the underlying graph has higher-order cliques. & 


We now turn to a non-Gaussian graphical model that shares a similar factorization: 


Example 11.4 (Ising model) Consider a vector X = (X1, ..., X4) of binary random vari- 
ables, with each X; € {0, 1}. The Ising model is one of the earliest graphical models, first 
introduced in the context of statistical physics for modeling interactions in a magnetic field. 
Given an undirected graph G = (V, E), it posits a factorization of the form 


jal 1 Tal tal 
Pia 8ar O) = Fy OP YS) 6;x; + Dy OX jXk ; (11.3) 
jEV GEE 


where the parameter 67 is associated with vertex j € V, and the parameter 0%, is associ- 
ated with edge (j,k) € E. The quantity Z(6*) is a constant that serves to enforce that the 
probability mass function p normalizes properly to one; more precisely, we have 


Z(6") = o exp Xx+ >. OX jXt 


xe{0,1}4 Jev GHEE 


See the bibliographic section for further discussion of the history and uses of this model. & 
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11.1.2 Conditional independence 


We now turn to an alternative way in which to connect the probabilistic and graphical struc- 
tures, involving certain conditional independence statements defined by the graph. These 
statements are based on the notion of a vertex cutset S, which (loosely stated) is a subset 
of vertices whose removal from the graph breaks it into two or more disjoint pieces. More 
formally, removing S from the vertex set V leads to the vertex-induced subgraph G(V \ S), 
consisting of the vertex set V \ S, and the residual edge set 


EV \S):={0, EE] AkEV\ SI. (11.4) 


The set S is a vertex cutset if the residual graph G(V\S ) consists of two or more disconnected 
non-empty components. See Figure 11.1(b) for an illustration. 

We now define a conditional independence relationship associated with each vertex cutset 
of the graph. For any subset A € V, let X4 := (X;, j € A) represent the subvector of random 
variables indexed by vertices in A. For any three disjoint subsets, say A, B and S, of the ver- 
tex set V, we use X4 IL Xg | Xs to mean that the subvector X4 is conditionally independent 
of Xg given Xs. 


Definition 11.5 A random vector X = (Xj,..., Xa) is Markov with respect to a graph 
G if, for all vertex cutsets S breaking the graph into disjoint pieces A and B, the condi- 
tional independence statement X4 1L Xz | Xs holds. 


Let us consider some examples to illustrate. 


Example 11.6 (Markov chain conditional independence) The Markov chain provides the 
simplest (and most classical) illustration of this definition. A chain graph on vertex set 
V = {1,2,...,d} contains the edges (j, j + 1) for j = 1,2,...,d — 1; the case d = 5 is 
illustrated in Figure 11.2(a). For such a chain graph, each vertex j € {2,3,...,d— 1} is 
a non-trivial cutset, breaking the graph into the “past” P = {1,2,..., j — 1} and “future” 
F ={j+1,...,d}. These singleton cutsets define the essential Markov property of a Markov 
time-series model—namely, that the past Xp and future Xp are conditionally independent 
given the present X;. & 


Example 11.7 (Neighborhood-based cutsets) Another canonical type of vertex cutset is 
provided by the neighborhood structure of the graph. For any vertex j € V, its neighborhood 
set is the subset of vertices 


N(j) := {k EV | (,k) € E} (11.5) 


that are joined to j by an edge. It is easy to see that N(j) is always a vertex cutset, a non- 
trivial one as long as j is not connected to every other vertex; it separates the graph into 
the two disjoint components A = {j} and B = V \ (N()) U {j}. This particular choice of 
vertex cutset plays an important role in our discussion of neighborhood-based methods for 
graphical model selection later in the chapter. 4 
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11.1.3 Hammersley—Clifford equivalence 


Thus far, we have introduced two (ostensibly distinct) ways of relating the random vector X 
to the underlying graph structure, namely the Markov property and the factorization prop- 
erty. We now turn to a fundamental theorem that establishes that these two properties are 
equivalent for any strictly positive distribution: 


a 
Theorem 11.8 (Hammersley—Clifford) Fora given undirected graph and any random 
vector X = (X1, . . . , Xa) with strictly positive density p, the following two properties are 
equivalent: 


(a) The random vector X factorizes according to the structure of the graph G, as in 
Definition 11.1. 
(b) The random vector X is Markov with respect to the graph G, as in Definition 11.5. 


Proof Here we show that the factorization property (Definition 11.1) implies the Markov 
property (Definition 11.5). See the bibliographic section for references to proofs of the con- 
verse. Suppose that the factorization (11.1) holds, and let S be an arbitrary vertex cutset of 
the graph such that subsets A and B are separated by S$. We may assume without loss of 
generality that both A and B are non-empty, and we need to show that X4 IL Xg | Xs. Let 
us define subsets of cliques by C4 := {Ce © | CNA z O}, Cg := {Ce C| CNB FD} 
and Cs := {C e © | C c S}. We claim that these three subsets form a disjoint partition of 
the full clique set—namely, € = C4 U Cs U Cg. Given any clique C, it is either contained 
entirely within S, or must have non-trivial intersection with either A or B, which proves the 
union property. To establish disjointedness, it is immediate that Cs is disjoint from €,4 and 
Cz. On the other hand, if there were some clique C € C4 N Cz, then there would exist nodes 
a € A and b € B with {a, b} € C, which contradicts the fact that A and B are separated by the 
cutset S. 
Given this disjoint partition, we may write 


I] sec i vec In vec 


Ceta CECs CEC 


1 
P(Xa,Xs,Xp) = Z 


~~“ 
¥a(xa.Xs ) Ys (xs) ‘¥a(xB.xs ) 
Defining the quantities 
Zalas) = $ Paa xs) and Zp(xs):= $ PCs xs), 
XA XB 


we then obtain the following expressions for the marginal distributions of interest: 


Za(xs) Zg(xs) Zg(xs) 
Z Z 


D(xs) = Ys(xs) and p(x4, xs) = Para, xs) Ys (xs), 


with a similar expression for p(xg, xs). Consequently, for any xs for which p(xs) > 0, we 
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may write 
DP(XA; Xs, Xp) _ 7 Vara, Xs) s (xs) Pa (Xp, Xs) = Yara, Xs) a(xp, Xs) (11.6) 
P(xs) Zales) Zels) Ys (xs) Za(Xs) Zg(xs) l 

Similar calculations yield the relations 

p(xa, xs) 2S) Pala, Xs) ‘Ps(xs) — Ba(%a, xs) (117a) 

pls) Zaos (x5) Za(xs) l 

and 

P(XB.Xs) _ AGS) ‘Pa(xp, xs) Ys (xs) _ Pa(xp, xs) (11.7b) 


P(x) g ACSA 5 (xs) g Zp(Xs ) 
Combining equation (11.6) with equations (11.7a) and (11.7b) yields 


P(*A, xB, Xs) _ Pa, Xs) PB, Xs) 
P(xs) P(xs) ~— p(xs) 
thereby showing that X4 1L Xg | Xs, as claimed. 


P(Xa, Xp | Xs) = = p(xa | Xs) p(xg | xs), 


11.1.4 Estimation of graphical models 


Typical applications of graphical models require solving some sort of inverse problem of the 
following type. Consider a collection of samples {x;}',, where each x; = (xj1,...,Xiq) iS a 
d-dimensional vector, hypothesized to have been drawn from some graph-structured proba- 
bility distribution. The goal is to estimate certain aspects of the underlying graphical model. 
In the problem of graphical parameter estimation, the graph structure itself is assumed to 
be known, and we want to estimate the compatibility functions {w%-, C € Œ} on the graph 
cliques. In the more challenging problem of graphical model selection, the graph structure 
itself is unknown, so that we need to estimate both it and the clique compatibility functions. 
In the following sections, we consider various methods for solving these problems for both 
Gaussian and non-Gaussian models. 


11.2 Estimation of Gaussian graphical models 


We begin our exploration of graph estimation for the case of Gaussian Markov random fields. 
As previously discussed in Example 11.3, for a Gaussian model, the factorization property is 
specified by the inverse covariance or precision matrix ©*. Consequently, the Hammersley— 
Clifford theorem is especially easy to interpret in this case: it ensures that Oj, = 0 for any 
(j,k) ¢ E. See Figure 11.2 for some illustrations of this correspondence between graph 
structure and the sparsity of the inverse covariance matrix. 

Now let us consider some estimation problems that arise for Gaussian Markov random 
fields. Since the mean is easily estimated, we take it to be zero for the remainder of our 
development. Thus, the only remaining parameter is the precision matrix ©*. Given an es- 
timate © of ©", its quality can be assessed in different ways. In the problem of graphical 
model selection, also known as (inverse) covariance selection, the goal is to recover the 
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Zero pattern of inverse covariance 


1 


2 


1 2: 3 4 5 
G C ; D 
(a) 
1 2 Zero pattern of inverse covariance 
1 
2 
3 
5 
4 
(c) 


Figure 11.2 For Gaussian graphical models, the Hammersley—Clifford theorem 
guarantees a correspondence between the graph structure and the sparsity pattern 
of the inverse covariance matrix or precision matrix @*. (a) Chain graph on five 
vertices. (b) Inverse covariance for a Gauss—Markov chain must have a tri-diagonal 
structure. (c), (d) More general Gauss—Markov random field and the associated in- 
verse covariance matrix. 


edge set E of the underlying graph G. More concretely, letting E denote an estimate of the 
edge set based on ©, one figure of merit is the error probability P[E # E], which assesses 
whether or not we have recovered the true underlying edge set. A related but more relaxed 
criterion would focus on the probability of recovering a fraction 1 — 6 of the edge set, where 
6 € (0,1) is a user-specified tolerance parameter. In other settings, we might be interested 
in estimating the inverse covariance matrix itself, and so consider various types of matrix 
norms, such as the operator norm ||@ — ©*|l2 or the Frobenius norm ||O — ©*||p. In the 
following sections, we consider these different choices of metrics in more detail. 


11.2.1 Graphical Lasso: €\-regularized maximum likelihood 


We begin with a natural and direct method for estimating a Gaussian graphical model, 
namely one based on the global likelihood. In order to do so, let us first derive a convenient 
form of the rescaled negative log-likelihood, one that involves the log-determinant function. 
For any two symmetric matrices A and B, recall that we use (A, B)) := trace(AB) to denote 
the trace inner product. The negative log-determinant function is defined on the space S? 
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of symmetric matrices as 

d 
- Yi logy(O) if @ > 0, 


j=l 
+00 otherwise, 


— log det(@) := (11.8) 


where yi1(Q) > y2(@) = --- > ya(@) denote the ordered eigenvalues of the symmetric ma- 
trix ©. In Exercise 11.1, we explore some basic properties of the log-determinant function, 
including its strict convexity and differentiability. 

Using the parameterization (11.2) of the Gaussian distribution in terms of the precision 
matrix, the rescaled negative log-likelihood of the multivariate Gaussian, based on samples 
{x;}/_,, takes the form 


L£,(®) = KO, E} — log det(), (11.9) 


where È := 1 1 xix; is the sample covariance matrix. Here we have dropped some con- 
stant factors in the log-likelihood that have no effect on the maximum likelihood solution, 
and also rescaled the log-likelihood by -2 for later theoretical convenience. 


The unrestricted maximum likelihood solution Ome takes a very simple form for the 
Gaussian model. If the sample covariance matrix L is invertible, we have Owe Š E, 
otherwise, the maximum likelihood solution is undefined (see Exercise 11.2 for more de- 
tails). Whenever n < d, the sample covariance matrix is always rank-deficient, so that the 
maximum likelihood estimate does not exist. In this setting, some form of regularization is 
essential. When the graph G is expected to have relatively few edges, a natural form of regu- 
larization is to impose an ¢,-constraint on the entries of ©. (If computational considerations 
were not a concern, it would be natural to impose fo-constraint, but as in Chapter 7, we use 
the ;-norm as a convex surrogate.) 

Combining ¢,-regularization with the negative log-likelihood yields the graphical Lasso 
estimator 


Oc arg oun (O, Ey — log det © +A, [Olli of ? , (11.10) 
E€ dxd |. P 
L, (0) 


where |||Olllio# := È jx |O ;| corresponds to the £1-norm applied to the off-diagonal entries 
of ©. One could also imagine penalizing the diagonal entries of ©, but since they must be 
positive for any non-degenerate inverse covariance, doing so only introduces additional bias. 
The convex program (11.10) is a particular instance of a log-determinant program, and can 
be solved in polynomial time with various generic algorithms. Moreover, there is also a line 
of research on efficient methods specifically tailored to the graphical Lasso problem; see the 
bibliographic section for further discussion. 


Frobenius norm bounds 


We begin our investigation of the graphical Lasso (11.10) by deriving bounds on the Fro- 
benius norm error |||© — ©*|||p. The following result is based on a sample covariance matrix X 
formed from n i.i.d. samples {x;}7_, of a zero-mean random vector in which each coordinate 
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has o-sub-Gaussian tails (recall Definition 2.2 from Chapter 2). 


Proposition 11.9 (Frobenius norm bounds for graphical Lasso) Suppose that the in- 
verse covariance matrix @* has at most m non-zero entries per row, and we solve the 


logd 
— + ô) for some 


ô € (0, 1]. Then as long as 6((||©*|ll, + 1)°a, Vd < 1, the graphical Lasso estimate © 
satisfies 


graphical Lasso (11.10) with regularization parameter A,, = 807( 


= 9 
IO - O'l < ———{, ma (11.11) 
((]O*lll2 + 1) 


with probability at least 1 — 8e 5”, 


Proof We prove this result by applying Corollary 9.20 from Chapter 9. In order to do so, 
we need to verify the restricted strong convexity of the loss function (see Definition 9.15), 
as well as other technical conditions given in the corollary. 

Let B-(1) = {A € S® | IIAllp < 1} denote the set of symmetric matrices with Fro- 
benius norm at most one. Using standard properties of the log-determinant function (see 
Exercise 11.1), the loss function underlying the graphical Lasso is twice differentiable, with 


V£,(O0)=XZ-O' and VL,(0)=0'e@", 
where & denotes the Kronecker product between matrices. 
Verifying restricted strong convexity: Our first step is to establish that restricted strong 


convexity holds over the Frobenius norm ball B;(1). Let vec(-) denote the vectorized form 
of a matrix. For any A € B-(1), a Taylor-series expansion yields 


L£,(O* + A) - L,(0") - (VL,(O"), A) = 5 vec(A)V2L,(0" + tA) vec(A) 
OS 
&,(A) 


for some t € [0, 1]. Thus, we have 


IAN 


1 
En(A) > =Ymin (V? L, (O* + tA NIE = 1, 
(A) > 5Ymin(V Lal )) Il vee(A)II5 210° + rA 


using the fact that |A} @ A~!|ll = for any symmetric invertible matrix. The triangle 


le 
AIL 
inequality, in conjunction with the bound ¢||Alll2 < tIlAlle < 1, implies that ||O* + rAll} < 


(|O* lll, + bee Combining the pieces yields the lower bound 


&)(A) = SIANG where x := (IIO*Il2 + 1)°, (11.12) 


showing that the RSC condition from Definition 9.15 holds over B-(1) with tolerance t2 = 0. 


Computing the subspace Lipschitz constant: Next we introduce a subspace suitable for 
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application of Corollary 9.20 to the graphical Lasso. Letting S denote the support set of ©*, 
we define the subspace 


MCS) = {0 € R™ | Ox = 0 for all (j,k) ¢ S}. 


With this choice, we have 
TOPI © 
w(M(S)) = sup ala <|S| < md, 
oeris) MOI 


where inequality (i) follows since ©* has at most m non-zero entries per row. 


Verifying event G(A,): Next we verify that the stated choice of regularization parameter J, 
satisfies the conditions of Corollary 9.20 with high probability: in order to do so, we need to 
compute the score function and obtain a bound on its dual norm. Since (@*)"! = E, the score 
function is given by VL,,(0*) = E-I, corresponding to the deviations between the sample 
covariance and population covariance matrices. The dual norm defined by |I|- Ili of is given 
by the ¢,.-norm applied to the off-diagonal matrix entries, which we denote by |I|- Illmax,oft- 
Using Lemma 6.26, we have 


Piz ie Dine > ot] < Sen is min{f, ?}+2 log d for allt > 0. 


Setting t = A„/o? shows that the event G(A,,) from Corollary 9.20 holds with the claimed 
probability. Consequently, Proposition 9.13 implies that the error matrix A satisfies the 
bound ||Agel]; < 3]|As|l1, and hence 


Alhi < 4lfAsllı < 4 VmdllAlle, (11.13) 


where the final inequality again uses the fact that |S | < md. In order to apply Corollary 9.20, 
the only remaining detail to verify is that A belongs to the Frobenius ball B;(1). 


Localizing the error matrix: By an argument parallel to the earlier proof of RSC, we have 
L0") ~ Ly(O* + A) + KYLO" + A), -AX > SIAI- 
Adding this lower bound to the inequality (11.12), we find that 
KYLO" + A) - VL,(O*), A) > KIAI- 
The result of Exercise 9.10 then implies that 
«VL,(O* + A) — VL,(O*), A) 2 x lAllle for all A € S® \ Br(1). (11.14) 


By the optimality of ©, we have 0 = (VL,(@* + A) + 4,Z, A), where Z € Al|Olllior is a 
subgradient matrix for the elementwise £,-norm. By adding and subtracting terms, we find 
that 


«VL,(O* + A) - VLO), AX < a, KZ, AY| + |KVL,(O"), AX] 
< {An +IVL(OMlnax IA. 
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Since ||VL,(O*)lmax < se under the previously established event G(A,,), the right-hand side 
is at most 
32, 
2 


Alli < 64, Vind |lAllle, 


where we have applied our earlier inequality (11.13). If WAlll: > 1, then our earlier lower 

bound (11.14) may be applied, from which we obtain 

3an 
2 


k|[Allls < Allh < 64, VindllAllle. 


6A, Vind 
K 


This inequality leads to a contradiction whenever < 1, which completes the proof. 


Edge selection and operator norm bounds 


Proposition 11.9 is a relatively crude result, in that it only guarantees that the graphical Lasso 
estimate © is close in Frobenius norm, but not that the edge structure of the underlying 
graph is preserved. Moreover, the result actually precludes the setting n < d: indeed, the 
conditions of Proposition 11.9 imply that the sample size n must be lower bounded by a 
constant multiple of md log d, which is larger than d. 

Accordingly, we now turn to a more refined type of result, namely one that allows for 
high-dimensional scaling (d > n), and moreover guarantees that the graphical Lasso es- 
timate © correctly selects all the edges of the graph. Such an edge selection result can be 
guaranteed by first proving that O is close to the true precision matrix ©* in the element- 
wise €..-norm on the matrix elements (denoted by || - ||max). In turn, such max-norm control 
can also be converted to bounds on the £2-matrix operator norm, also known as the spectral 
norm. 

The problem of edge selection in a Gaussian graphical model is closely related to the 
problem of variable selection in a sparse linear model. As previously discussed in Chap- 
ter 7, variable selection with an ¢\-norm penalty requires a certain type of incoherence 
condition, which limits the influence of irrelevant variables on relevant ones. In the case 
of least-squares regression, these incoherence conditions were imposed on the design ma- 
trix, or equivalently on the Hessian of the least-squares objective function. Accordingly, in 
a parallel manner, here we impose incoherence conditions on the Hessian of the objective 
function £, in the graphical Lasso (11.10). As previously noted, this Hessian takes the form 
V? L,(O) = ©! @@"!", a d? x d? matrix that is indexed by ordered pairs of vertices (j, k). 

More specifically, the incoherence condition must be satisfied by the d?-dimensional ma- 
trix I* := V? L,(O*), corresponding to the Hessian evaluated at the true precision matrix. We 
use S := E U {(j, J) | j € V} to denote the set of row/column indices associated with edges 
in the graph (including both (j,k) and (k, j)), along with all the self-edges (j, j). Letting 
S° = (V x V)\S, we say that the matrix I* is a-incoherent if 


max I Ežo <1l-a@  forsomea e (0, 1]. (11.15) 
eese 


With this definition, we have the following result: 
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Proposition 11.10 Consider a zero-mean d-dimensional Gaussian distribution based 
on an a-incoherent inverse covariance matrix @*. Given a sample size lower bounded 
as n > co(1 + 8a7!)?m? logd, suppose that we solve the graphical Lasso (11.10) with a 


Gl wed 


regularization parameter A, = 2 + 6 for some 6 € (0, 1]. Then with probability 


-c3n ð 


at least 1 — c2e , we have the ‘pllowing: 


(a) The graphical Lasso solution leads to no false inclusions—that is, © x = 0 for all 
(Gk) ¢ E. 
(b) It satisfies the sup-norm bound 


= d 


IO — O*llmax < C4 2(1 + 807!) aS ace (11.16) 


t(n,d,@) 


Note that part (a) guarantees that the edge set estimate 
E := (j,k) € [d] x [d] | j < k and O, # 0} 


is always a subset of the true edge set E. Part (b) guarantees that O is uniformly close to 
©* in an elementwise sense. Consequently, if we have a lower bound on the minimum non- 
zero entry of |O*|—namely the quantity T*(0*) = ming peE |0, —then we can guarantee that 
the graphical Lasso recovers the full edge set correctly. In particular, using the notation of 
part (b), as long as this minimum is lower bounded as t*(@*) > ca(t(n, d, œ) + A,), then the 
graphical Lasso recovers the correct edge set with high probability. 

The proof of Proposition 11.10 is based on an extension of the primal—dual witness tech- 
nique used to prove Theorem 7.21 in Chapter 7. In particular, it involves constructing a pair 
of matrices (©, Z), where © > 0 is a primal optimal solution and Za corresponding dual 
optimum. This pair of matrices is required to satisfy the zero subgradient conditions that 
define the optimum of the graphical Lasso (11.10)—namely 


L-O'4 AL =0 or equivalently @!'} = 5+ AL. 


The matrix Z must belong to the subgradient of the |I|- Illi, off function, evaluated at O, meaning 
that Zllmax. oft < 1, and that Z ik = Sign(® ;,) whenever © ik + 0. We refer the reader to the 
bibliographic section for further details and references for the proof. 


Proposition 11.10 also implies bounds on the operator norm error in the estimate O. 


Corollary 11.11 (Operator norm bounds) Under the conditions of Proposition 11.10, 


consider the graphical Lasso estimate © with regularization parameter A, = 2 wed +6 
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for some 6 € (0, 1]. Then with probability at least 1 — ce", we have 
~ 4 logd 
IIO — O'lly < c4 |All, 4C +80 ) Ta Ants (11.17a) 


where A denotes the adjacency matrix of the graph G (including ones on the diagonal). 
In particular, if the graph has maximum degree m, then 


a i 
| — O'll> < c4(m + 1) fa repeats ey esa up (11.17b) 
n 


< 4 


Proof These claims follow in a straightforward way from Proposition 11.10 and certain 
properties of the operator norm exploited previously in Chapter 6. In particular, Proposi- 
tion 11.10 guarantees that for any pair (j,k) ¢ E, we have |©; — © = 0, whereas the 
bound (11.16) ensures that for any pair (j,k) € E, we have OF — ©; < caft(n,d, œ) + Àn}. 
Note that the same bound holds whenever j = k. Putting together the pieces, we conclude 
that 


Ox - Oj < ca{t(n, d, œ) + An} A jx, (11.18) 


where A is the adjacency matrix, including ones on the diagonal. Using the matrix-theoretic 
properties from Exercise 6.3(c), we conclude that 


I® — ©*lk < |] — ©*lll2 < c4{t(n, d, œ) + An} |All, 


thus establishing the bound (11.17a). The second inequality (11.17b) follows by noting that 
IIAll2 < m+1 for any graph of degree at most m. (See the discussion following Corollary 6.24 
for further details.) 


As we noted in Chapter 6, the bound (11.17b) is not tight for a general graph with 
maximum degree m. In particular, a star graph with one hub connected to m other nodes 
(see Figure 6.1(b)) has maximum degree m, but satisfies ||A]l2 = 1+ Vm — 1, so that the 


logd m: : 
ne“. This guarantee is 


bound (11.17a) implies the operator norm bound | - @* |, = 
tighter by a factor of ym than the conservative bound (11.17b). 
It should also be noted that Proposition 11.10 also implies bounds on the Frobenius norm 


error. In particular, the elementwise bound (11.18) implies that 
~ 2 _, flogd 
IO — O' lle < c3 V2s +d (1 + 8a) 4| — +a, (11.19) 
n 


where s is the total number of edges in the graph. We leave the verification of this claim as 
an exercise for the reader. 


11.2.2 Neighborhood-based methods 


The Gaussian graphical Lasso is a global method, one that estimates the full graph simul- 
taneously. An alternative class of procedures, known as neighborhood-based methods, are 
instead local. They are based on the observation that recovering the full graph is equivalent 
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to recovering the neighborhood set (11.5) of each vertex j € V, and that these neighborhoods 
are revealed via the Markov properties of the graph. 


Neighborhood-based regression 


Recall our earlier Definition 11.5 of the Markov properties associated with a graph. In our 
discussion following this definition, we also noted that for any given vertex j € V, the neigh- 
borhood N( j) is a vertex cutset that breaks the graph into the disjoint pieces {j} and V\N*(J), 
where we have introduced the convenient shorthand N*(j) := {j} U N(J). Consequently, by 
applying the definition (11.5), we conclude that 


X Min aes vey: (11.20) 


Thus, the neighborhood structure of each node is encoded in the structure of the conditional 
distribution. What is a good way to detect these conditional independence relationships and 
hence the neighborhood? A particularly simple method is based on the idea of neighborhood 
regression: for a given vertex j € V, we use the random variables Xy} := {Xx | k € V \ {j}} 
to predict X;, and keep only those variables that turn out to be useful. 

Let us now formalize this idea in the Gaussian case. In this case, by standard properties 
of multivariate Gaussian distributions, the conditional distribution of X; given X; is also 
Gaussian. Therefore, the random variable X; has a decomposition as the sum of the best 
linear prediction based on Xj plus an error term—namely 


X; = (Xp, 05) + Wj, (11.21) 


where 6° € R*' is a vector of regression coefficients, and W; is a zero-mean Gaussian 
variable, independent of Xj. (See Exercise 11.3 for the derivation of these and related 
properties.) Moreover, the conditional independence relation (11.20) ensures that 6%, = 0 for 
all k ¢ N(j). In this way, we have reduced the problem of Gaussian graph selection to that 
of detecting the support in a sparse linear regression problem. As discussed in Chapter 7, the 
Lasso provides a computationally efficient approach to such support recovery tasks. 

In summary, the neighborhood-based approach to Gaussian graphical selection proceeds 
as follows. Given n samples {x,,...,X,}, we use X € R’“@ to denote the design matrix with 
x; € Rf as its ith row, and then perform the following steps. 


á 
Lasso-based neighborhood regression: 


1 For each node j € V: 


(a) Extract the column vector X; € R” and the submatrix Xy € RX), 
(b) Solve the Lasso problem: 


QERA! 


eS : 1 
0 = arg min (5x; — Xall + a} (11.22) 


(c) Return the neighborhood estimate N (/) = {keV \ {y} | + 0}. 
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2 Combine the neighborhood estimates to form an edge estimate E, using either the 
OR tule or the AND rule. 


Note that the first step returns a neighborhood estimate N (j) for each vertex j € V. 
These neighborhood estimates may be inconsistent, meaning that for a given pair of distinct 
vertices (j,k), it may be the case that k € N (j) whereas j ¢ N (k). Some rules to resolve this 
issue include: 


e the OR rule that declares that (j, k) € Eor if either k € N (j) or j EN (k); 
e the AND rule that declares that (j,k) € Eanp if k € N(j) and j € N(k). 


By construction, the AND rule is more conservative than the OR rule, meaning that E, AND Ē 
Eor. The theoretical guarantees that we provide end up holding for either rule, since we 
control the behavior of each neighborhood regression problem. 


Graph selection consistency 


We now state a result that guarantees selection consistency of neighborhood regression. 
As with our previous analysis of the Lasso in Chapter 7 and the graphical Lasso in Sec- 
tion 11.2.1, we require an incoherence condition. Given a positive definite matrix IF and a 
subset S of its columns, we say I is œ-incoherent with respect to S if 


max [IT ks Css)” <l-a. (11.23) 


Here the scalar œ € (0, 1] is the incoherence parameter. As discussed in Chapter 7, if we view 
I as the covariance matrix of a random vector Z € Rf, then the row vector Iys(I'ss)”! speci- 
fies the coefficients of the optimal linear predictor of Z; given the variables Zs := {Z;, j € S}. 
Thus, the incoherence condition (11.23) imposes a limit on the degree of dependence be- 
tween the variables in the correct subset S and any variable outside of S. 

The following result guarantees graph selection consistency of the Lasso-based neighbor- 
hood procedure, using either the AND or the OR rules, for a Gauss—Markov random field in 
which the covariance matrix Z* = (@*)~! has maximum degree m, and diagonals scaled such 
that diag(X*) < 1. This latter inequality entails no loss of generality, since it can always be 
guaranteed by rescaling the variables. Our statement involves the f,,-matrix-operator norm 

Finally, in stating the result, we assume that the sample size is lower bounded as n = 
mlog d. This assumption entails no loss of generality, because a sample size of this order is 
actually necessary for any method. See the bibliographic section for further details on such 
information-theoretic lower bounds for graphical model selection. 


Theorem 11.12 (Graph selection consistency) Consider a zero-mean Gaussian ran- 
dom vector with covariance X* such that for each j € V, the submatrix X\, ) = cov(X yj) 
is a-incoherent with respect to N(j), and |Z, noa lie < b for some b > 1. Suppose 


that the neighborhood Lasso selection method is implemented with A, = co{4 V ed +6} 
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1 


for some ô € (0, 1]. Then with probability greater than | — coe enmi, a), the estimated 
edge set E, based on either the AND or OR rules, has the following properties: 


(a) No false inclusions: it includes no false edges, so that ECE. 
(b) All significant edges are captured: it includes all edges (j, k) for which 107, | > 7bAy. 


Of course, if the non-zero entries of the precision matrix are bounded below in absolute 
value as ming jez lO% > 7b A,, then in fact Theorem 11.12 guarantees that E = E with high 
probability. 


Proof It suffices to show that for each j € V, the neighborhood N( j) is recovered with high 
probability; we can then apply the union bound over all the vertices. The proof requires an 
extension of the primal—dual witness technique used to prove Theorem 7.21. The main differ- 
ence is that Theorem 11.12 applies to random covariates, as opposed to the case of determin- 
istic design covered by Theorem 7.21. In order to reduce notational overhead, we adopt the 
shorthand I = cov(X\;;,) along with the two subsets S = N(j) and S° = V \ N*(j). In this 


notation, we can write our observation model as X; = X\;j6* + Wj, where Xyp € R™&) 


T 
W 


covariance defined by the design matrix, and we use Tss to denote the submatrix indexed 
by the subset S, with the submatrix I's<s defined similarly. 


while X; and W; are both n-vectors. In addition, we let T= 1X Xz denote the sample 


Proof of part (a): We follow the proof of Theorem 7.21 until equation (7.53), namely 


= sos = x W; 
Zse = Tses Ess) Zs +X [In - Xs (X$Xs) "XS (=) l (11.24) 
Meana paaa n 
pERI-s — FT 
VsceRe-s 


As argued in Chapter 7, in order to establish that the Lasso support is included within S, it 
suffices to establish the strict dual feasibility condition |[Zs-||.. < 1. We do so by establishing 
that 


< ce ene -logd (1 1 25a) 


3 
P hwans > 1 = 4” 


and 
P [ivs > z] Zee om -logd (11.25b) 


Qa 


Taken together, these bounds ensure that |[Zs-|| < 1 —- 5 < 1, and hence that the Lasso 
support is contained within S = N(j), with probability at least 1 — ce% "24, where the 
values of the universal constants may change from line to line. Taking the union bound over 


all d vertices, we conclude that E c E with probability at least 1 — ce’, 


Let us begin by establishing the bound (11.25a). By standard properties of multivariate 
Gaussian vectors, we can write 


X5- = Lýes C5) Xs + Wee, (11.26) 


where Ws. € R”™'5‘l is a zero-mean Gaussian random matrix that is independent of Xs. 
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Observe moreover that 
cov(Wse) = Vyege —VgesV'$5) Tose < I". 


Recalling our assumption that diag(T*) < 1, we see that the elements of Ws. have variance 
at most 1. 
Using the decomposition (11.26) and the triangle inequality, we have 


lllo = |IPs.s 3s ai kia zl 
a-a E s Xs F To Zs |. (11.27) 
UN 2s 
VeRIS‘| 


where step (i) uses the population-level a-incoherence condition. Turning to the remaining 
stochastic term, conditioned on the design matrix, the vector V is a zero-mean Gaussian 
random vector, each entry of which has standard deviation at most 


SIRE Psy Zle < aE ‘ls lle 
ce ssl, vin 

bm 

= he 


where inequality (i) follows with probability at least 1 — 4e~'”, using standard bounds on 
Gaussian random matrices (see Theorem 6.1). Using this upper bound to control the condi- 
tional variance of V, standard Gaussian tail bounds and the union bound then ensure that 


y nt a 
P[I > t] < 28e t < 2 etn toed, 


eee logd 2] 1/2 bm logd 


We now set t = Ta La , a quantity which is less than 7 as long asn>c for 
a sufficiently large vee constant. Thus, we have ey that Vloo < < ¢ with TO 
bility at least 1 — cye2"” -logd Combined with the earlier bound (11.27), the claim (11.25a) 
follows. 

Turning to the bound (11.25b), note that the matrix H := I, — Xs(X{Xs)"'X¢ has the 


range of Xs as its nullspace. Thus, using the decomposition (11.26), we have 


: W; 
Vgc =wi.n(; 1), 


where Ws. € RIS is independent of I and Wj. Since II is a projection matrix, we have 
IEW; < ||Wjllz. The vector W; € R” has i.i.d. Gaussian entries with variance at most 1, 
and hence the event & = {—= Wee 2} holds with probability at least 1 — 2e™”. Conditioning on 


this event and its complement, we find that 


P[|lVselleo > f] < PIV >t] el +26", 
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Conditioned on £, each element of Vs. has variance at most >, and hence 


na? 
Ana 


a ‘ 
P[IVsclloo = ras ee tosis" 4 Qe, 


where we have combined the union bound with standard Gaussian tail bounds. Since A, 
na? 
256 


IV 


fioga , 
co{+ 28C + 6} for a universal constant co that may be chosen, we can ensure that 


cona’* + 2logd for some constant c2, for which it follows that 


04 2 
P[Vselloo > l < cje a -logd +2e". 


Proof of part (b): In order to prove part (b) of the theorem, it suffices to establish ¢,,-bounds 
on the error in the Lasso solution. Here we provide a proof in the case m < log d, referring 
the reader to the bibliographic section for discussion of the general case. Again returning to 
the proof of Theorem 7.21, equation (7.54) guarantees that 


(0; — Glo < Es x E] + a, Es 


< (Eso XI] +a, [Es = 35) Me +E} 01128) 


Now for any symmetric m X m matrix, we have 


m m 
lAl = max > Axl < Vm max 4 > lAl? < VnllAlll2. 
fli m Lieg m 
t=1 é=1 


Applying this bound to the matrix A = T, Dus rs aor we find that 


MEss — Ws) Meo < Valls - Ws) lb. (11.29) 


Since [IPS clll2 < [IES sll < b, applying the random matrix bound from Theorem 6.1 allows 
us to conclude that 


T -1 =: * \—l m l logd 
llEss) -(@Œ5s) <2 (2+ oy |, 


with probability at least 1 — c;e7™® 17084, Combined with the earlier bound (11.29), we find 


that 
Be 2 1 (i) 
ss)! - Ces) Meo < JR +1+ {0 e | < 6b, (11.30) 
n n 


where inequality (i) uses the assumed lower bound n = mlogd > m°. Putting together the 
pieces in the bound (11.28) leads to 


lls — oll < | ys) x |. +TbÀ,. (11.31) 
——— 
Us 


Now the vector W; € R” has i.i.d. Gaussian entries, each zero-mean with variance at most 
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var(X;) < 1, and is independent of Xs. Consequently, conditioned on Xs, the quantity Us is 
a zero-mean Gaussian m-vector, with maximal variance 


Pore eae lis MAE ae Tb 
~IIdiag(T'ss) ‘Ile < ={MEs5)" = C55) lo + M55)" 'Mlo}-< Z, 


where we have combined the assumed bound |I% alls < b with the inequality (11.30). 
Therefore, the union bound combined with Gaussian tail bounds implies that 
r © -cnb -log d 
PlllUsllo = ban] < AS le T < cye? sa 
where, as in our earlier argument, inequality (i) can be guaranteed by a sufficiently large 
choice of the pre-factor co in the definition of 2,. Substituting back into the earlier bound 
(11.31), we find that ||@5 —@§ llv < 7bA,, with probability at least 1 — Cpe MO Ag Ios d Finally, 
taking the union bound over all vertices j € V causes a loss of at most a factor log d in the 
exponent. 


11.3 Graphical models in exponential form 


Let us now move beyond the Gaussian case, and consider the graph estimation problem 
for a more general class of graphical models that can be written in an exponential form. In 
particular, for a given graph G = (V, E), consider probability densities that have a pairwise 
factorization of the form 


Por(X1,-.+5%q) © af enon > marso, (11.32) 


jEV GOE 


where ©; is a vector of parameters for node j € V, and ©}, is a matrix of parameters for 
edge (j,k). For instance, the Gaussian graphical model is a special case in which ©} = 0; 
and @%, = 0%, are both scalars, the potential functions take the form 


(xj; 97) = OX j, PD jx(X js Xr Fy) = OX jXko (11.33) 


and the density (11.32) is taken with respect to Lebesgue measure over R°. The Ising 
model (11.3) is another special case, using the same choice of potential functions (11.33), 
but taking the density with respect to the counting measure on the binary hypercube {0, 1¥. 


Let us consider a few more examples of this factorization: 


Example 11.13 (Potts model) The Potts model, in which each variable X, takes values 
in the discrete set {0,..., M — 1} is another special case of the factorization (11.32). In this 
case, the parameter ©; = {®;aa = 1,..., M—1}is an (M - 1)-vector, whereas the parameter 
©% = {Ojn-ay 4b = 1,...,M-— 1} is an (M — 1) x (M —- 1) matrix. The potential functions 
take the form 

M-1 


gia 05) = >) Oa Ix; = a] (11.34a) 
a=1 


366 Graphical models for high-dimensional data 


and 
M-1M-1 
nlx, OF) = YY) Og e = a, Xe = BI. (11.34b) 
a=1 b=1 
Here l[x; = a] is a zero—one indicator function for the event that {x; = a}, with the indicator 
function [[x; = a, x, = b] defined analogously. Note that the Potts model is a generalization 
of the Ising model (11.3), to which it reduces for variables taking M = 2 states. à 


Example 11.14 (Poisson graphical model) Suppose that we are interested in modeling a 
collection of random variables (X4, ..., X4), each of which represents some type of count 
data taking values in the set of positive integers Z, = {0,1,2,...}. One way of building 
a graphical model for such variables is by specifying the conditional distribution of each 
variable given its neighbors. In particular, suppose that variable X;, when conditioned on its 
neighbors, is a Poisson random variable with mean 


Hj = exp i + b ga 


keN(Jj) 


This form of conditional distribution leads to a Markov random field of the form (11.32) 
with 


$ (xj; 0) = 0x; — log(x!) for all je V, (11.35a) 


P je js Xe Op) = Ox jk for all (j,k) € E. (11.35b) 


Here the density is taken with respect to the counting measure on Z, for all variables. A 
potential deficiency of this model is that, in order for the density to be normalizable, we 
must necessarily have O <0 for all (j,k) € E. Consequently, this model can only capture 
competitive interactions between variables. 4 


One can also consider various types of mixed graphical models, for instance in which 
some of the nodes take discrete values, whereas others are continuous-valued. Gaussian 
mixture models are one important class of such models. 


11.3.1 A general form of neighborhood regression 


We now consider a general form of neighborhood regression, applicable to any graphical 
model of the form (11.32). Let {x;}_; be a collection of n samples drawn i.i.d. from such a 
graphical model; here each x; is a d-vector. Based on these samples, we can form a matrix 
X € R™ with x as the ith row. For j = 1,...,d, we let X; € R” denote the jth column of 
X. Neighborhood regression is based on predicting the column X; € R” using the columns 
of the submatrix Xy € RO”. 

Consider the conditional likelihood of X; € R” given Xyp € R™ ©. As we show in 
Exercise 11.6, for any distribution of the form (11.32), this conditional likelihood depends 
only on the vector of parameters 


O; := {0 Ox, kK EV \ {i} (11.36) 
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that involve node j. Moreover, in the true model ©*, we are guaranteed that ©; = 0 when- 
ever (j,k) ¢ E, so that it is natural to impose some type of block-based sparsity penalty on 


©. Letting |||- || denote some matrix norm, we arrive at a general form of neighborhood 
regression: 
fe . 1d 
©). = arg min{ 2 2, log po, (ai | Xy) +n 2a ' Hut. (11.37) 
i= j 


Ln(O jz; xja) 


This formulation actually describes a family of estimators, depending on which norm || - III 
that we impose on each matrix component © ;,. Perhaps the simplest is the Frobenius norm, 
in which case the estimator (11.37) is a general form of the group Lasso; for details, see equa- 
tion (9.66) and the associated discussion in Chapter 9. Also, as we verify in Exercise 11.5, 
this formula reduces to €|-regularized linear regression (11.22) in the Gaussian case. 


11.3.2 Graph selection for Ising models 


In this section, we consider the graph selection problem for a particular type of non-Gaussian 
distribution, namely the Ising model. Recall that the Ising distribution is over binary vari- 
ables, and takes the form 


Por(X1,--+5X%g) © eXp Xx+ oy OXAK? « (11.38) 


jEV GOE 


Since there is only a single parameter per edge, imposing an £1-penalty suffices to encourage 
sparsity in the neighborhood regression. For any given node j € V, we define the subset of 
coefficients associated with it—namely, the set 


Oj. = fO; Ok E VD} 


For the Ising model, the neighborhood regression estimate reduces to a form of logistic 
regression—specifically 


O; = arg mafi 2: f (Oxi + 2. Oni jie) +A, Ds al). (11.39) 
is i=1 


keV\{j} keV\{j} 
LalO jti Xj, Mj) 


where f(t) = log(1 + e’) is the logistic function. See Exercise 11.7 for details. 

Under what conditions does the estimate (11.39) recover the correct neighborhood set 
N(j)? As in our earlier analysis of neighborhood linear regression and the graphical Lasso, 
such a guarantee requires some form of incoherence condition, limiting the influence of 
irrelevant variables—those outside N(j)—on variables inside the set. Recalling the cost 
function L, in the optimization problem (11.39), let 6;, denote the minimizer of the pop- 
ulation objective function £(6;,) = E[L,(0;4; Xj, X\1j))]. We then consider the Hessian of 
the cost function £ evaluated at the “true parameter” 6;, namely, the d-dimensional matrix 


J:= VLO „). For a given a € (0, 1], we say that J satisfies an a-incoherence condition at 
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node j€ Vif 


max [Wes Jss) ‘Th <l-a, (11.40) 


where we have introduced the shorthand S = N(j) for the neighborhood set of node j. In 
addition, we assume the submatrix Jss has its smallest eigenvalue lower bounded by some 
Cmin > 0. With this set-up, the following result applies to an Ising model (11.38) defined on a 
graph G with d vertices and maximum degree at most m, with Fisher information J at node 
j satisfying the Cmin-eigenvalue bound, and the a-incoherence condition (11.40). 


Theorem 11.15 Given n i.i.d. samples with n > com? logd, consider the estima- 
tor (11.39) with A, = 2 NI eed + 6 for some 6 € [0,1]. Then with probability at least 


L=cyen2"® 8), the estimate Op has the following properties: 


(a) It has a support S = supp@) that is contained within the neighborhood set N( j). 
(b) It satisfies the €..-bound |\0j+ — 6; llo < = yma). 


As with our earlier results on the neighborhood and graphical Lasso, part (a) guarantees 
that the method leads to no false inclusions. On the other hand, the f,.-bound in part (b) en- 
sures that the method picks up all significant variables. The proof of Theorem 11.15 is based 
on the same type of primal—dual witness construction used in the proof of Theorem 11.12. 
See the bibliographic section for further details. 


11.4 Graphs with corrupted or hidden variables 


Thus far, we have assumed that the samples {x;}"_, are observed perfectly. This idealized 
setting can be violated in a number of ways. The samples may be corrupted by some type 
of measurement noise, or certain entries may be missing. In the most extreme case, some 
subset of the variables are never observed, and so are known as hidden or latent variables. 
In this section, we discuss some methods for addressing these types of problems, focusing 


primarily on the Gaussian case for simplicity. 


11.4.1 Gaussian graph estimation with corrupted data 


Let us begin our exploration with the case of corrupted data. Letting X € R’”“ denote the data 
matrix corresponding to the original samples, suppose that we instead observe a corrupted 
version Z. In the simplest case, we might observe Z = X + V, where the matrix V represents 
some type of measurement error. A naive approach would be simply to apply a standard 
Gaussian graph estimator to the observed data, but, as we will see, doing so typically leads 
to inconsistent estimates. 
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Correcting the Gaussian graphical Lasso 


Consider the graphical Lasso (11.10), which is usually based on the sample covariance ma- 
trix LZ, = 1X™X = 1 di-1 xix] of the raw samples. The naive approach would be instead to 
solve the convex program 


Onai = arg nin, ko, E.) — log det © + ltt or}. (11.41) 


where È, = I1Z"Z = +S. ziz} is now the sample covariance based on the observed data 
matrix Z. However, as we explore in Exercise 11.8, the addition of noise does not preserve 
Markov properties, so that—at least in general—the estimate Onar will not lead to consistent 
estimates of either the edge set, or the underlying precision matrix ©*. In order to obtain a 
consistent estimator, we need to replace E, with an unbiased estimator of cov(x) based on 
the observed data matrix Z. In order to develop intuition, let us explore a few examples. 


Example 11.16 (Unbiased covariance estimate for additive corruptions) In the additive 
noise setting (Z = X + V), suppose that each row v; of the noise matrix V is drawn i.i.d. from 
a zero-mean distribution, say with covariance &,. In this case, a natural estimate of X, := 
cov(x) is given by 


ome | 
T := -Z'Z -—%,,. (11.42) 
n 


As long as the noise matrix V is independent of X, then T is an unbiased estimate of £,. 
Moreover, as we explore in Exercise 11.12, when both X and V have sub-Gaussian rows, 


logd 
n 


then a deviation condition of the form IE = Zyllmax zS holds with high probability. æ 


Example 11.17 (Missing data) In other settings, some entries of the data matrix X might 
be missing, with the remaining entries observed. In the simplest model of missing data— 
known as missing completely at random—entry (i, j) of the data matrix is missing with some 
probability v € [0, 1). Based on the observed matrix Z € R’“, we can construct a new matrix 
Z € R™ with entries 


— J l-v 
Zij = 


re Ži if entry (i, j) is observed, 
0 otherwise. 


With this choice, it can be verified that 
E Pas ZZ 
T= -Z'Z- ving =) (11.43) 
n n 


is an unbiased estimate of the covariance matrix X, = cov(x), and moreover, under suitable 
tail conditions, it also satisfies the deviation condition Ir = Zyllmax aj 24 with high 


probability. See Exercise 11.13 for more details. 4 


More generally, any unbiased estimate T of x, defines a form of the corrected graphical 
Lasso estimator 


© = arg min {(O, T) - log det © + An||Ollh or} - (11.44) 
Ocse4 
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As with the usual graphical Lasso, this is a strictly convex program, so that the solution 
(when it exists) must be unique. However, depending on the nature of the covariance es- 
timate T, it need not be the case that the program (11.44) has any solution at all! In this 
case, equation (11.44) is nonsensical, since it presumes the existence of an optimal solution. 
However, in Exercise 11.9, we show that as long as 1, > IE — ÈŁllmax, then this optimiza- 
tion problem has a unique optimum that is achieved, so that the estimator is meaningfully 
defined. Moreover, by inspecting the proofs of the claims in Section 11.2.1, it can be seen 
that the estimator © obeys similar Frobenius norm and edge selection bounds as the usual 
graphical Lasso. Essentially, the only differences lie in the techniques used to bound the 
deviation ||P — £xllmax- 


Correcting neighborhood regression 


We now describe how the method of neighborhood regression can be corrected to deal 
with corrupted or missing data. Here the underlying optimization problem is typically non- 
convex, so that the analysis of the estimator becomes more interesting than the corrected 
graphical Lasso. 

As previously described in Section 11.2.2, the neighborhood regression approach involves 
solving a linear regression problem, in which the observation vector X; € R” at a given node 
j plays the role of the response variable, and the remaining (d — 1) variables play the role 
of the predictors. Throughout this section, we use X to denote the n x (d — 1) matrix with 
{Xx k € V \ {j} as its columns, and we use y = X; to denote the response vector. With this 
notation, we have an instance of a corrupted linear regression model, namely 


y=Xő+w and Z~Q-|X), (11.45) 


where the conditional probability distribution Q varies according to the nature of the cor- 
ruption. In application to graphical models, the response vector y might also be further cor- 
rupted, but this case can often be reduced to an instance of the previous model. For instance, 
if some entries of y = X; are missing, then we can simply discard those data points in per- 
forming the neighborhood regression at node j, or if y is subject to further noise, it can be 
incorporated into the model. 

As before, the naive approach would be simply to solve a least-squares problem involving 
the cost function lly — ZO\|5. As we explore in Exercise 11.10, doing so will lead to an in- 
consistent estimate of the neighborhood regression vector 6". However, as with the graphical 
Lasso, the least-squares estimator can also be corrected. What types of quantities need to be 
“corrected” in order to obtain a consistent form of linear regression? Consider the following 
population-level objective function 


L0) = 40'TA- (6, y), (11.46) 


where I := cov(x) and y := cov(x,y). By construction, the true regression vector is the 
unique global minimizer of £. Thus, a natural strategy is to solve a penalized regression 
problem in which the pair (y, I) are replaced by data-dependent estimates (y, T). Doing so 
leads to the empirical objective function 


L,(0) = 46'T0 - (0, 7). (11.47) 


To be clear, the estimates (7, T) must be based on the observed data (y, Z). In Examples 11.16 
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and 11.17, we described suitable unbiased estimators T for the cases of additive corruptions 
and missing entries, respectively. Exercises 11.12 and 11.13 discuss some unbiased estima- 
tors y of the cross-covariance vector y. 
Combining the ingredients, we are led to study the following corrected Lasso estimator 
min__{56"T0 - (7, 6) + Anll}. (11.48) 


llall < 


logd 


Note that it combines the objective function (11.47) with an ¢,-penalty, as well as an £1- 
constraint. At first sight, including both the penalty and constraint might seem redundant, 
but as shown in Exercise 11.11, this combination is actually needed when the objective func- 
tion (11.47) is non-convex. Many of the standard choices of T lead to non-convex programs: 
for instance, in the high-dimensional regime (n < d), the previously described choices of T 
given in equations (11.42) and (11.43) both have negative eigenvalues, so that the associated 
optimization problem is non-convex. 

When the optimization problem (11.48) is non-convex, it may have local optima in addi- 
tion to global optima. Since standard algorithms such as gradient descent are only guaranteed 
to converge to local optima, it is desirable to have theory that applies them. More precisely, 
a local optimum for the program (11.48) is any vector 8 € R¢ such that 


(VL, ®, 0-60)>0 for all 8 such that ||6||, < |si (11.49) 
When © belongs to the interior of the constraint set—that is, when it satisfies the inequal- 
ity iial < is 7 strictly—then this condition reduces to the usual zero-gradient condition 


VL(0) = 0. Thus, our specification includes both local minima, local maxima and saddle 
points. 

We now establish an interesting property of the corrected Lasso (11.48): under suitable 
conditions—ones that still permit non-convexity—any local optimum is relatively close to 
the true regression vector. As in our analysis of the ordinary Lasso from Chapter 7, we 
impose a restricted eigenvalue (RE) condition on the covariance estimate T: more precisely, 
we assume that there exists a constant x > O such that 
logd 


IAI for all A € RY. (11.50) 
n 


(A, TA) > xi[All3 — co 


Interestingly, such an RE condition can hold for matrices T that are indefinite (with both 
positive and negative eigenvalues), including our estimators for additive corruptions and 
missing data from Examples 11.16 and 11.17. See Exercises 11.12 and 11.13, respectively, 
for further details on these two cases. 

Moreover, we assume that the minimizer 6* of the population objective (11.46) has spar- 
sity s and £-norm at most one, and that the sample size n is lower bounded as n > slogd. 


These assumptions ensure that ||6"||| < Ys < | 


tagd» SO that 6* is feasible for the non-convex 
Lasso (11.48). 
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g 


Proposition 11.18 Under the RE condition (11.50), suppose that the pair (y, T) satisfy 
the deviation condition 


= logd 
|" = Mhie < p(Q, Tw) 2 > 


(11.51) 


for a pre-factor (Q, o,,) depending on the conditional distribution Q and noise stan- 


dard deviation o „. Then for any regularization parameter A, > 2(2co + Y(Q, Tw)) a 


any local optimum @ to the program (11.48) satisfies the bound 


z 2 
6 -— "ln < : V's An. (11.52) 


< d 


In order to gain intuition for the constraint (11.51), observe that the optimality of 6° for 
the population-level objective (11.46) implies that VL(6") = Tø —y = 0. Consequently, con- 
dition (11.51) is the sample-based and approximate equivalent of this optimality condition. 
Moreover, under suitable tail conditions, it is satisfied with high probability by our previ- 
ous choices of (7, T) for additively corrupted or missing data. Again, see Exercises 11.12 
and 11.13 for further details. 


Proof We prove this result in the special case when the optimum occurs in the interior of 
the set |l] < 


isz z- (See the bibliographic section for references to the general result.) In 


this case, any local optimum O must satisfy the condition VL, (6) + Anz = = 0, where Z belongs 
to the subdifferential of the £;-norm at 8. Define the error vector A:= 0-6. Adding and 
subtracting terms and then taking inner products with A yields the inequality 


(A, VLO + A) -— VL) < KA, VLO) - an (Z A) 
< AVLO + An flO", — Ulli}, 


where we have used the facts that P 8) = illl and (z, 6*) < |l@*|lı. From the proof of 
Theorem 7.8, since the vector 6* is S-sparse, we have 


llh -Øh < Asli — Aseh. (11.53) 


Since VL,,(6) = To- y, the deviation condition (11.51) is equivalent to the bound 


logd 
IVLi(8 < Q, ow) q7, 


which is less than 4,,/2 by our choice of regularization parameter. Consequently, we have 
~ SA Ay ~ ~ 3 e l1 o e 
(A, TA) < HA + Afis — Aseh} = 5Anllsth - 5 aullAsclh. (11.54) 


Since 6* is s-sparse, we have ||6@*||1 < ¥5|l@'ll2 < where the final inequality follows 


n 
y toga” 
from the assumption that n > slog d. Consequently, we have 


F ~ n 
Alh < lllh + lll, < 2| —. 
All, < lâl + I"I < Viogd 
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Combined with the RE condition (11.50), we have 


PETR ~ logd — ~ logd — 
(A, TA) > KAB - co~ ECAR = KAB - 2c0 4} SESAN. 


Recombining with our earlier bound (11.54), we have 


~ logd — 30 Oe 1, — 
KING < 2co J ZESA + 5 AnlfAsth — 5 AnllAscl 


1 
< > 
2 
= 2A, |lAslh. 


oa Bien a i 
A Alea Alp =4A,lAse 
Alli 5 As lh 5 lAselli 


Since ||As||1 < -Ys|lAlb, the claim follows. 


11.4.2 Gaussian graph selection with hidden variables 


In certain settings, a given set of random variables might not be accurately described using 
a sparse graphical model on their own, but can be when augmented with an additional set of 
hidden variables. The extreme case of this phenomenon is the distinction between indepen- 
dence and conditional independence: for instance, the random variables X; = Shoe size 
and X, = Gray hair are likely to be dependent, since few children have gray hair. How- 
ever, it might be reasonable to model them as being conditionally independent given a third 
variable—namely X; = Age. 

How to estimate a sparse graphical model when only a subset of the variables are ob- 
served? More precisely, consider a family of d + r random variables—say written as X := 
(Xis... , Xa, Xd+1» <- - , Xd+r)—and suppose that this full vector can be modeled by a sparse 
graphical model with d + r vertices. Now suppose that we observe only the subvector 
Xo := (X1, . . . , Xa), with the other components Xy := (Xg41,..., Xa4,) staying hidden. Given 
this partial information, our goal is to recover useful information about the underlying graph. 

In the Gaussian case, this problem has an attractive matrix-theoretic formulation. In par- 
ticular, the observed samples of Xo give us information about the covariance matrix LG. 
On the other hand, since we have assumed that the full vector is Markov with respect to a 
sparse graph, the Hammersley—Clifford theorem implies that the inverse covariance matrix 
©° of the full vector X = (Xo, Xn) is sparse. This (d + r)-dimensional matrix can be written 
in the block-partitioned form 


O O° 
© = | 00 on (11.55) 
Oho Onn 


The block-matrix inversion formula (see Exercise 11.3) ensures that the inverse of the d- 
dimensional covariance matrix 26, has the decomposition 


(Xho) = Obo - Obn(Ofn) “Oho - (11.56) 
—_— -—_—_—_—~— 
Tl At 


By our modeling assumptions, the matrix I“ := ©ĝo is sparse, whereas the second com- 
ponent A* := 05,;(On) Ohio has rank at most min{r, d}. Consequently, it has low rank 
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whenever the number of hidden variables r is substantially less than the number of observed 
variables d. In this way, the addition of hidden variables leads to an inverse covariance matrix 
that can be decomposed as the sum of a sparse and a low-rank matrix. 

Now suppose that we are given n i.i.d. samples x; € R? from a zero-mean Gaussian 
with covariance &,,. In the absence of any sparsity in the low-rank component, we require 
n > d samples to obtain any sort of reasonable estimate (recall our results on covariance 
estimation from Chapter 6). When n > d, then the sample covariance matrix E- 1 Dai xix] 
will be invertible with high probability, and hence setting Y := @Œ@)!, we can consider an 
observation model of the form 


Y =I" -A‘+W. (11.57) 


Here W € R® is a stochastic noise matrix, corresponding to the difference between the 
inverses of the population and sample covariances. This observation model (11.57) is a 
particular form of additive matrix decomposition, as previously discussed in Section 10.7. 
How to estimate the components of this decomposition? In this section, we analyze a very 
simple two-step estimator, based on first computing a soft-thresholded version of the inverse 
sample covariance Y as an estimate of I, and secondly, taking the residual matrix as an 
estimate of A*. In particular, for a threshold v, > 0 to be chosen, we define the estimates 


T:=7,,(@)!) and A: =r- O". (11.58) 


Here the hard-thresholding operator is given by T,,,(v) = v I[lv| > va]. 

As discussed in Chapter 10, sparse-plus-low-rank decompositions are unidentifiable un- 
less constraints are imposed on the pair (T*, A*). As with our earlier study of matrix decom- 
positions in Section 10.7, we assume here that the low-rank component satisfies a “spik- 
iness” constraint, meaning that its elementwise max-norm is bounded as ||A*||max < 3. In 
addition, we assume that the matrix square root of the true precision matrix ©* = I — A* 
has a bounded €,,-operator norm, meaning that 


d 
|| VOl» = max "| VO" < VM. (11.59) 
eT =| 


aera 


In terms of the parameters (a, M), we then choose the threshold parameter vy, in our esti- 
mates (11.58) as 


logd 


YI nfa +6] + 7 for some 6 € [0, 1]. (11.60) 


Proposition 11.19 Consider a precision matrix ©* that can be decomposed as the 
difference 1 — A*, where I* has most s non-zero entries per row, and A* is a-spiky. 
Given n > d i.i.d. samples from the N(0,(@*)~') distribution and any ô € (0, 1], the 
estimates T, A) satisfy the bounds 


(11.61a) 


s logd 2 
(P= ee = 2u(4 aunty +) F — 
n 
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and 


IA- A*lb < M 2 uc + a) a oE = ille (11.61b) 
n 


—canð? 


with probability at least 1 — ce 


Proof We first prove that the inverse sample covariance matrix Y := (Z)~! is itself a good 
estimate of @*, in the sense that, for all 6 € (0, 1], 


d 
NY - O°, < M (2 a2 +ô) (11.62a) 
n 


l 
IY — O*llmax < o(44/ 224 +6) (11.62b) 
n 


with probability at least 1 — cye~2”. 
To prove the first bound (11.62a), we note that 


E-E = VE VV L) Ve. (11.63) 


and 


where V € R” is a standard Gaussian random matrix. Consequently, by sub-multiplicativity 
of the operator norm, we have 


2)! = O's < I VOl VTV = Lull I VOI = NOI In VTV = Lill 


À d 
< [|0' lle gE + o] 


where the final inequality holds with probability 1 — ce”, via an application of Theo- 
rem 6.1. To complete the proof, we note that 


lO“ lls < [O*lloo < (I VOl) < M, 


from which the bound (11.62a) follows. 
Turning to the bound (11.62b), using the decomposition (11.63) and introducing the short- 
hand ÈX = vy —I,, we have 


pees 
pean 


ç 2 
< Illma max, | VO°e 
FFA, 


Now observe that 


sð || a 2 


where the final inequality uses the symmetry of V@*. Putting together the pieces yields that 
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IO! = O*|lmax < Ml[Ellmax- Since E = V'V/n— I, where V € R"? is a matrix of i.i.d. stan- 


dard normal variates, we have Elma <4 wed + 6 with probability at least 1 — ceo for 


all 6 € [0, 1]. This completes the proof of the bound (11.62b). 


Next we establish bounds on the estimates T, A) previously defined in equation (11.58). 
Recalling our shorthand Y = (x)! , by the definition of T and the triangle inequality, we 
have 


IT E T” || max < IY = Ollmax + IY = Ty, CY) llmax E |A“|lmax 


thereby establishing inequality (11.61a). 
Turning to the operator norm bound, the triangle inequality implies that 


~~ * * T x d T K 
IA — A*i < WY - Ol; + IE - E*l < M 2 2 + | + IIE - T*|l2- 


Recall that I has at most s-non-zero entries per row. For any index (j,k) such that r =0, 


we have Oj, = Nips and hence 


logd Q 
Viel < |Y- oi+ al< mfa Oea soeg ew 


Consequently Tx = T,,(Yjx) = 0 by construction. Therefore, the error matrix T —I* has at 
most s non-zero entries per row, whence 


IT - I", < IT - Flo = ar Sie -T4 < sl -D"llmax- 


PEON 


Putting together the pieces yields the claimed bound (11.61b). 


11.5 Bibliographic details and background 


Graphical models have a rich history, with parallel developments taking place in statistical 
physics (Ising, 1925; Bethe, 1935; Baxter, 1982), information and coding theory (Gallager, 
1968; Richardson and Urbanke, 2008), artificial intelligence (Pearl, 1988) and image pro- 
cessing (Geman and Geman, 1984), among other areas. See the books (Lauritzen, 1996; 
Mézard and Montanari, 2008; Wainwright and Jordan, 2008; Koller and Friedman, 2010) 
for further background. The Ising model from Example 11.4 was first proposed as a model 
for ferromagnetism in statistical physics (Ising, 1925), and has been extensively studied. The 
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Hammersley—Clifford theorem derives its name from the unpublished manuscript (Hammer- 
sley and Clifford, 1971). Grimmett (1973) and Besag (1974) were the first to publish proofs 
of the result; see Clifford (1990) for further discussion of its history. Lauritzen (1996) pro- 
vides discussion of how the Markov factorization equivalence can break down when the 
strict positivity condition is not satisfied. There are a number of connections between the 
classical theory of exponential families (Barndorff-Nielson, 1978; Brown, 1986) and graph- 
ical models; see the monograph (Wainwright and Jordan, 2008) for further details. 

The Gaussian graphical Lasso (11.10) has been studied by a large number of researchers 
(e.g., Friedman et al., 2007; Yuan and Lin, 2007; Banerjee et al., 2008; d’ Aspremont et al., 
2008; Rothman et al., 2008; Ravikumar et al., 2011), in terms of both its statistical and 
optimization-related properties. The Frobenius norm bounds in Proposition 11.9 were first 
proved by Rothman et al. (2008). Ravikumar et al. (2011) proved the model selection results 
given in Proposition 11.10; they also analyzed the estimator for more general non-Gaussian 
distributions, and under a variety of tail conditions. There are also related analyses of Gaus- 
sian maximum likelihood using various forms of non-convex penalties (e.g., Lam and Fan, 
2009; Loh and Wainwright, 2017). Among others, Friedman et al. (2007) and d’ Asprémont 
et al. (2008) have developed efficient algorithms for solving the Gaussian graphical Lasso. 

Neighborhood-based methods for graph estimation have their roots in the notion of pseudo- 
likelihood, as studied in the classical work of Besag (1974; 1975; 1977). Besag (1974) dis- 
cusses various neighbor-based specifications of graphical models, including the Gaussian 
graphical model from Example 11.3, the Ising (binary) graphical model from Example 11.4, 
and the Poisson graphical model from Example 11.14. Meinshausen and Btihlmann (2006) 
provided the first high-dimensional analysis of the Lasso as a method for neighborhood se- 
lection in Gaussian graphical models. Their analysis, and that of related work by Zhao and 
Yu (2006), was based on assuming that the design matrix itself satisfies the a-incoherence 
condition, whereas the result given in Theorem 11.12, adapted from Wainwright (2009b), 
imposes these conditions on the population, and then proves that the sample versions satisfy 
them with high probability. Whereas we only proved Theorem 11.12 when the maximum 
degree m is at most log d, the paper (Wainwright, 2009b) provides a proof for the general 
case. 

Meinshausen (2008) discussed the need for stronger incoherence conditions with the 
Gaussian graphical Lasso (11.10) as opposed to the neighborhood selection method; see 
also Ravikumar et al. (2011) for further comparison of these types of incoherence condi- 
tions. Other neighborhood-based methods have also been studied in the literature, including 
methods based on the Dantzig selector (Yuan, 2010) and the CLIME-based method (Cai 
et al., 2011). Exercise 11.4 works through some analysis for the CLIME estimator. 

Ravikumar et al. (2010) analyzed the ¢;-regularized logistic regression method for Ising 
model selection using the primal—dual witness method; Theorem 11.15 is adapted from their 
work. Other authors have studied different methods for graphical model selection in discrete 
models, including various types of entropy tests, thresholding methods and greedy meth- 
ods (e.g., Netrapalli et al., 2010; Anandkumar et al., 2012; Bresler et al., 2013; Bresler, 
2014). Santhanam and Wainwright (2012) prove lower bounds on the number of samples re- 
quired for Ising model selection; combined with the improved achievability results of Bento 
and Montanari (2009), these lower bounds show that ¢\-regularized logistic regression is 
an order-optimal method. It is more natural—as opposed to estimating each neighborhood 
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separately—to perform a joint estimation of all neighborhoods simultaneously. One way in 
which to do so is to sum all of the conditional likelihoods associated with each node, and 
then optimize the sum jointly, ensuring that all edges use the same parameter value in each 
neighborhood. The resulting procedure is equivalent to the pseudo-likelihood method (Be- 
sag, 1975, 1977). Hoefling and Tibshirani (2009) compare the relative efficiency of various 
pseudo-likelihood-type methods for graph estimation. 


The corrected least-squares cost (11.47) is a special case of a more general class of 
corrected likelihood methods (e.g., Carroll et al., 1995; Iturria et al., 1999; Xu and You, 
2007). The corrected non-convex Lasso (11.48) was proposed and analyzed by Loh and 
Wainwright (2012; 2017). A related corrected form of the Dantzig selector was analyzed 
by Rosenbaum and Tsybakov (2010). Proposition 11.18 is a special case of more general re- 
sults on non-convex M-estimators proved in the papers (Loh and Wainwright, 2015, 2017). 


The matrix decomposition approach to Gaussian graph selection with hidden variables 
was pioneered by Chandrasekaran et al. (2012b), who proposed regularizing the global 
likelihood (log-determinant function) with nuclear and ¢;-norms. They provided sufficient 
conditions for exact recovery of sparsity and rank using the primal—dual witness method, 
previously used to analyze the standard graphical Lasso (Ravikumar et al., 2011). Ren and 
Zhou (2012) proposed more direct approaches for estimating such matrix decompositions, 
such as the simple estimator analyzed in Proposition 11.19. Agarwal et al. (2012) analyzed 
both a direct approach based on thresholding and truncated SVD, as well as regularization- 
based methods for more general problems of matrix decomposition. As with other work on 
matrix decomposition problems (Candeés et al., 2011; Chandrasekaran et al., 2011), Chan- 
drasekaran et al. (2012b) performed their analysis under strong incoherence conditions, es- 
sentially algebraic conditions that ensure perfect identifiability for the sparse-plus-low-rank 
problem. The milder constraint, namely of bounding the maximum entry of the low-rank 
component as in Proposition 11.19, was introduced by Agarwal et al. (2012). 


In addition to the undirected graphical models discussed here, there is also a substan- 
tial literature on methods for directed graphical models; we refer the reader to the sources 
(Spirtes et al., 2000; Kalisch and Biihlmann, 2007; Biihlmann and van de Geer, 2011) and 
references therein for more details. Liu et al. (2009; 2012) propose and study the non- 
paranormal family, a nonparametric generalization of the Gaussian graphical model. Such 
models are obtained from Gaussian models by applying a univariate transformation to the 
random variable at each node. The authors discuss methods for estimating such models; see 
also Xue and Zou (2012) for related results. 


11.6 Exercises 


Exercise 11.1 (Properties of log-determinant function) Let S®“ denote the set of symmet- 
ric matrices, and S% denote the cone of symmetric and strictly positive definite matrices. In 
this exercise, we study properties of the (negative) log-determinant function F : S® — R 
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given by 
d 
= DS logy(®) if © € S*, 


j=l 
+00 otherwise, 


F(®) = 


where y ;(@) > 0 are the eigenvalues of O. 


(a) Show that F is a strictly convex function on its domain S“*“, 

(b) For ®© € S**4, show that VF(Q) = -O"!. 

(c) For © € S®“, show that VF? (O) = O~! @@"!. 

Exercise 11.2 (Gaussian MLE) Consider the maximum likelihood estimate of the inverse 
covariance matrix @* for a zero-mean Gaussian. Show that it takes the form 


~ I! if È > 0, 
Ome = . 
not defined otherwise, 
where È = 1 SL, xix; is the empirical covariance matrix for a zero-mean vector. (When 
x is rank-deficient, you need to show explicitly that there exists a sequence of matrices for 


which the likelihood diverges to infinity.) 


Exercise 11.3 (Gaussian neighborhood regression) Let X € R? be a zero-mean jointly 
Gaussian random vector with strictly positive definite covariance matrix X*. Consider the 
conditioned random variable Z := (X; | X\;;), where we use the shorthand \{ j} = V \ {j}. 


(a) Establish the validity of the decomposition (11.21). 

(b) Show that 6; = Œ$, iw Xj, j- 

(c) Show that 0 =0 whenever k ¢ N(J). 
Hint: The following elementary fact could be useful: let A be an invertible matrix, given 
in the block-partitioned form 


Ai Ai 
A= : 
ie a 


Then letting B = A™!, we have (see Horn and Johnson (1985)) 
By = (An - Aa (An) Anp)" and By = (Ay) 'Ap[Ao (Au) Ap — An]. 


Exercise 11.4 (Alternative estimator of sparse precision matrix) Consider a d-variate Gaus- 
sian random vector with zero mean, and a sparse precision matrix ©*. In this exercise, we 
analyze the estimator 


© = arg gin {Oll} such that Eo —Tallmax < An, (11.64) 
È dxd 


where & is the sample covariance based on n i.i.d. samples. 
(a) For j = 1,...,d, consider the linear program 


T; € arg min ||P) such that EP; — e jllmax An (11.65) 
TjeR 


380 Graphical models for high-dimensional data 


where e; € R? is the jth canonical basis vector. Show that O is optimal for the original 
program (11.64) if and only if its jth column © jis optimal for the program (11.65). 

(b) Show that jl < ||O%||1 for each j = 1,...,d whenever the regularization parameter is 
lower bounded as 4, > O* || |Z Dun 

(c) State and prove a high-probability bound on E — X*|lmax. (For simplicity, you may as- 
sume that max j= 

(d) Use the preceding Pak to show that, for an appropriate choice of 4,, there is a universal 
constant c such that 


aa 5 logd 
lO — O*llmax < c NO" s 


(11.66) 


with high probability. 


Exercise 11.5 (Special case of general neighborhood regression) Show that the general 
form of neighborhood regression (11.37) reduces to linear regression (11.22) in the Gaussian 
case. (Note: You may ignore constants, either pre-factors or additive ones, that do not depend 
on the data.) 


Exercise 11.6 (Structure of conditional distribution) Given a density of the form (11.32), 
show that the conditional likelihood of X; given X\;;, depends only on 

O; = 100k, k eV \ {i}. 
Prove that ©, = 0 whenever (j,k) ¢ E. 


Exercise 11.7 (Conditional distribution for Ising model) For a binary random vector X € 
{—-1, 1}¢, consider the family of distributions 


poli,- xa) = exp{ $, Opx- DO), (11.67) 
(KEE 


where E is the edge set of some undirected graph G on the vertices V = {1,2,..., d}. 


(a) For each edge (j,k) € E, show that 2o = E gl X ;Xx]. 

(b) Compute the conditional disttibution ‘of X; j given the subvector of random variables 
Xua (= {Xe k € V \ {j}. Give an expression in terms of the logistic function f(t) = 
log(1 + e’). 


Exercise 11.8 (Additive noise and Markov properties) Let X = (X1, ..., X4) be a zero-mean 
Gaussian random vector that is Markov with respect to some graph G, and let Z = X + V, 
where V ~ N(0,07I,) is an independent Gaussian noise vector. Supposing that o7||O*|l|, < 
1, derive an expression for the inverse covariance of Z in terms of powers of o”@*. Interpret 
this expression in terms of weighted path lengths in the graph. 


Exercise 11.9 (Solutions for corrected graphical Lasso) In this exercise, we explore prop- 
erties of the corrected graphical Lasso from equation (11.44). 
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(a) Defining X, := cov(x), show that as long as 2, > IE — Lyllmax» then the corrected 
graphical Lasso (11.44) has a unique optimal solution. 

(b) Show what can go wrong when this condition is violated. (Hint: It suffices to consider a 
one-dimensional example.) 


Exercise 11.10 (Inconsistency of uncorrected Lasso) Consider the linear regression model 
y = X6* + w, where we observe the response vector y € R” and the corrupted matrix Z = 
X + V. A naive estimator of 6 is 


~ 1 
g= in 4 —|ly — Zal $, 
arg min { 59 ie} 


where we regress y on the corrupted matrix Z. Suppose that each row of X is drawn i.i.d. 
from a zero-mean distribution with covariance È, and that each row of V is drawn i.1.d. (and 
independently from X) from a zero-mean distribution with covariance o7/. Show that Ø is 
inconsistent even if the sample size n — +co with the dimension fixed. 


Exercise 11.11 (Solutions for corrected Lasso) Show by an example in two dimensions 
that the corrected Lasso (11.48) may not achieve its global minimum if an ¢,-bound of the 
form ||6||; < R for some radius R is not imposed. 


Exercise 11.12 (Corrected Lasso for additive corruptions) In this exercise, we explore 
properties of corrected linear regression in the case of additive corruptions (Example 11.16), 
under the standard model y = X6* + w. 


(a) Assuming that X and V are independent, show that T from equation (11.42) is an un- 

biased estimate of £, = cov(x), and that Y = Z'y/n is an unbiased estimate of cov(x, y). 
(b) Now suppose that in addition both X and V are generated with i.i.d. rows from a zero- 
mean distribution, and that each element X;; and V;; is sub-Gaussian with parameter 1, 
and that the noise vector w is independent with i.i.d. N(0, o°) entries. Show that there is 
a universal constant c such that 

EE -Fle < co + 1) 2" 

with high probability. 
In addition to the previous assumptions, suppose that &, = vI, for some v > 0. Show 
that T satisfies the RE condition (11.50) with high probability. (Hint: The result of Ex- 
ercise 7.10 may be helpful to you.) 


(c 


wm 


Exercise 11.13 (Corrected Lasso for missing data) In this exercise, we explore properties 
of corrected linear regression in the case of missing data (Example 11.17). Throughout, we 
assume that the missing entries are removed completely independently at random, and that 
X has zero-mean rows, generated in an i.i.d. fashion from a 1-sub-Gaussian distribution. 


(a) Show that the matrix T from equation (11.43) is an unbiased estimate of £, := cov(x), 
and that the vector y = Zy is an unbiased estimate of cov(x, y). 
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(b) Assuming that the noise vector w € R” has i.i.d. N(0, o°) entries, show there is a uni- 
versal constant c such that 


logd 


ITO" — Flo < clo + Illl) 


with high probability. 
(c) Show that F satisfies the RE condition (11.50) with high probability. (Hint: The result 
of Exercise 7.10 may be helpful to you.) 


12 


Reproducing kernel Hilbert spaces 


Many problems in statistics—among them interpolation, regression and density estimation, 
as well as nonparametric forms of dimension reduction and testing—involve optimizing over 
function spaces. Hilbert spaces include a reasonably broad class of functions, and enjoy a 
geometric structure similar to ordinary Euclidean space. A particular class of function-based 
Hilbert spaces are those defined by reproducing kernels, and these spaces—known as repro- 
ducing kernel Hilbert spaces (RKHSs)—have attractive properties from both the compu- 
tational and statistical points of view. In this chapter, we develop the basic framework of 
RKHSs, which are then applied to different problems in later chapters, including nonpara- 
metric least-squares (Chapter 13) and density estimation (Chapter 14). 


12.1 Basics of Hilbert spaces 


Hilbert spaces are particular types of vector spaces, meaning that they are endowed with 
the operations of addition and scalar multiplication. In addition, they have an inner product 
defined in the usual way: 


Definition 12.1 An inner product on a vector space V is a mapping ¢-, -)y : Vx V > R 
such that 


(F, Oxy = && IDs: for all f,g € Y, (12.1a) 
(f, fly 20 for all f € Y, with equality iff f =0, (12.1b) 
(f +ag, hy =(f, hy +a (g, hy forall f,g,h € Y anda € R. (12.1c) 
U ) 


A vector space equipped with an inner product is known as an inner product space. Note 
that any inner product induces a norm via ||flly := /(f, fy. Given this norm, we can then 
define the usual notion of Cauchy sequence—that is, a sequence (fy); with elements in Y 
is Cauchy if, for all e > 0, there exists some integer N(e) such that 


lfa — fally < € for all n, m > N(e). 
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Definition 12.2 A Hilbert space H is an inner product space (<, -);,, H) in which 


every Cauchy sequence (fn); in H converges to some element f* € H. 


A metric space in which every Cauchy sequence (fn)%_; converges to an element f* of the 
space is known as complete. Thus, we can summarize by saying that a Hilbert space is a 
complete inner product space. 


Example 12.3 (Sequence space €7(IN)) Consider the space of square-summable real-valued 
sequences, namely 


P(N) = {02 DI < oœ}, 
j=l 
This set, when endowed with the usual inner product (0, Y) = Lj-1 9/7), defines a clas- 
sical Hilbert space. It plays an especially important role in our discussion of eigenfunctions 
for reproducing kernel Hilbert spaces. Note that the Hilbert space R”, equipped with the 
usual Euclidean inner product, can be obtained as a finite-dimensional subspace of £7(\N): in 
particular, the space R” is isomorphic to the “slice” 


fae @(N) |6;=0 forall > m+}. 4 


Example 12.4 (The space L°[0,1]) Any element of the space L7[0,1] is a function 


f: [0,1] — R that is Lebesgue-integrable, and whose square satisfies the bound || f ERR = 


f f? (x) dx < œ. Since this norm does not distinguish between functions that differ only on 
a set of zero Lebesgue measure, we are implicitly identifying all such functions. The space 
L’[0, 1] is a Hilbert space when equipped with the inner product (f, g) P01) = ili f(x)g(x) dx. 
When the space L7[0, 1] is clear from the context, we omit the subscript in the inner product 
notation. In a certain sense, the space L7[0, 1] is equivalent to the sequence space £?(N). In 
particular, let (¢;);°, be any complete orthonormal basis of L’[0, 1]. By definition, the basis 
functions satisfy ||@jllz2j0,1, = 1 for all j € N, and (¢;, ¢;) = 0 for all i + j, and, moreover, 
any function f € L?[0, 1] has the representation f = È 21 4j0;, where a; := (f, 6;) is the jth 
basis coefficient. By Parseval’s theorem, we have 
urosa. 
j=l 
so that f € L[0, 1] if and only if the sequence a = (a Dra € CN). The correspondence 
f © (a;)2, thus defines an isomorphism between L?[0, 1] and (N). & 
All of the preceding examples are instances of separable Hilbert spaces, for which there 
is acountable dense subset. For such Hilbert spaces, we can always find a collection of func- 
tions (ġ Dyan orthonormal in the Hilbert space—meaning that (¢;, 6;)H = 6;; for all positive 
integers i, j—such that any f € H can be written in the form f = dja ajġ; for some se- 
quence of coefficients (aj)? € E(N). Although there do exist non-separable Hilbert spaces, 
here we focus primarily on the separable case. 
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The notion of a linear functional plays an important role in characterizing reproducing 
kernel Hilbert spaces. A linear functional on a Hilbert space H is a mapping L: H > R 
that is linear, meaning that L(f + ag) = L(f) + aL(g) for all f,g € Handa € R. A linear 
functional is said to be bounded if there exists some M < oo such that |L(f)| < M|| fly for all 
f € H. Given any g € H, the mapping f + (f, g),, defines a linear functional. It is bounded, 
since by the Cauchy—Schwarz inequality we have Kf, g),,| < M ||flln for all f € H, where 
M := |lglly. The Riesz representation theorem guarantees that every bounded linear func- 
tional arises in exactly this way. 


Theorem 12.5 (Riesz representation theorem) Let L be a bounded linear functional 
on a Hilbert space. Then there exists a unique g € H such that L(f) = (f, 8) for all 
f € H. (We refer to g as the representer of the functional L.) 


Proof Consider the nullspace IN(L) = {h € H | L(t) = 0}. Since L is a bounded linear 
operator, the nullspace is closed (see Exercise 12.1). Moreover, as we show in Exercise 12.3, 
for any such closed subspace, we have the direct sum decomposition H = IN(Z) + [IN(L)]-, 
where [IN(L)]* consists of all g € H such that (h, g),, = 0 for all h € N(L). If N(Z) = H, 
then we take g = 0. Otherwise, there must exist a non-zero element go € [IN(L)]*, and by 
rescaling appropriately, we may find some g € [IN(L)]* such that ||g||p; = L(g) > 0. We then 
define h := L(f)g — L(g) f, and note that L(A) = 0 so that h € IN(L). Consequently, we must 
have (g, h),, = 0, which implies that L(f) = (g, f}ẹ as desired. As for uniqueness, suppose 
that there exist g,g’ € H such that (g, f),, = Lf) = (2', fy for all f € H. Rearranging 
yields (g — g’, fy} = 0 for all f € H, and setting f = g — g’ shows that ||g — g'li, = 0, and 
hence g = g’ as claimed. 
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We now turn to the notion of a reproducing kernel Hilbert space, or RKHS for short. These 
Hilbert spaces are particular types of function spaces—more specifically, functions f with 
domain X mapping to the real line R. There are many different but equivalent ways in which 
to define an RKHS. One way is to begin with the notion of a positive semidefinite kernel 
function, and use it to construct a Hilbert space in an explicit way. A by-product of this con- 
struction is the reproducing property of the kernel. An alternative, and somewhat more ab- 
stract, way is by restricting attention to Hilbert spaces in which the evaluation functionals— 
that is, the mappings from the Hilbert space to the real line obtained by evaluating each 
function at a given point—are bounded. These functionals are particularly relevant in sta- 
tistical settings, since many applications involve sampling a function at a subset of points 
on its domain. As our development will clarify, these two approaches are equivalent in that 
the kernel acts as the representer for the evaluation functional, in the sense of the Riesz 
representation theorem (Theorem 12.5). 
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12.2.1 Positive semidefinite kernel functions 


Let us begin with the notion of a positive semidefinite kernel function. It is a natural gener- 
alization of the idea of a positive semidefinite matrix to the setting of general functions. 


Definition 12.6 (Positive semidefinite kernel function) A symmetric bivariate func- 
tion K: XxX — Ris positive semidefinite (PSD) if for all integers n > 1 and elements 
{xj}, C X, the n x n matrix with elements K;; := K(x;, xj) is positive semidefinite. 


This notion is best understood via some examples. 


Example 12.7 (Linear kernels) When X = Rf, we can define the linear kernel function 
K(x, x’) := (x, x’). Itis clearly a symmetric function of its arguments. In order to verify the 
positive semidefiniteness, let {x;}7_, be an arbitrary collection of points in Rf, and consider 
the matrix K € R”” with entries K;; = (x;, x;). For any vector œ € R”, we have 


a'Ka = 5 Qi ;(Xi, Xj) = | >: axil > 0. 
i=l 


ij=l 


Since n € N, {x}, and œ € R” were all arbitrary, we conclude that K is positive semi- 


definite. & 


Example 12.8 (Polynomial kernels) A natural generalization of the linear kernel on Rf is 
the homogeneous polynomial kernel K(x, z) = (<x, z})” of degree m > 2, also defined on R°. 
Let us demonstrate the positive semidefiniteness of this function in the special case m = 2. 
Note that we have 

d 


d 
K(x, z) = O, xe) = >, ee +2 3 XjX j(ZiZj)- 
j=l 


j=l i<j 
Setting D=d+ (8); let us define a mapping ©: R? > R? with entries 
be for j= 1,2,...,d 


D(x) = 
( ) V2xix j, for i < J 


; (12.2) 
corresponding to all polynomials of degree two in (x;,..., x4). With this definition, we see 
that K can be expressed as a Gram matrix—namely, in the form K(x, z) = (P(x), ®(Z)) po. 
Following the same argument as Example 12.7, it is straightforward to verify that this Gram 
representation ensures that K must be positive semidefinite. 

An extension of the homogeneous polynomial kernel is the inhomogeneous polynomial 
kernel K(x, z) = (1 + <x, z} )”, which is based on all polynomials of degree m or less. We 
leave it as an exercise for the reader to check that it is also a positive semidefinite kernel 
function. 4 


Example 12.9 (Gaussian kernels) As a more exotic example, given some compact subset 
X c R¢, consider the Gaussian kernel K(x, z) = exp (-shllx - all). Here, unlike the linear 
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kernel and polynomial kernels, it is not immediately obvious that K is positive semidefinite, 
but it can be verified by building upon the PSD nature of the linear and polynomial kernels 
(see Exercise 12.19). The Gaussian kernel is a very popular choice in practice, and we return 
to study it further in the sequel. 4 


12.2.2 Feature maps in C (N) 


The mapping x +» ®(x) defined for the polynomial kernel in equation (12.2) is often referred 
to as a feature map, since it captures the sense in which the polynomial kernel function 
embeds the original data into a higher-dimensional space. The notion of a feature mapping 
can be used to define a PSD kernel in far more generality. Indeed, any function ®: X > 
E(N) can be viewed as mapping the original space X to some subset of the space €7(IN) of all 
square-summable sequences. Our previously discussed mapping (12.2) for the polynomial 
kernel is a special case, since R? is a finite-dimensional subspace of ¢7(IN). 

Given any such feature map, we can then define a symmetric kernel via the inner product 
K(x, z) = (P(x), P(zZ))¢,(n. It is often the case, for suitably chosen feature maps, that this 
kernel has a closed-form expression in terms of the pair (x, z). Consequently, we can com- 
pute inner products between the embedded data pairs (®(x), ®(z)) without actually having 
to work in £ (N), or some other high-dimensional space. This fact underlies the power of 
RKHS methods, and goes under the colloquial name of the “kernel trick”. For example, in 
the context of the mth-degree polynomial kernel on R? from Example 12.8, evaluating the 
kernel requires on the order of d basic operations, whereas the embedded data lies in a space 
of roughly d” (see Exercise 12.11). Of course, there are other kernels that implicitly embed 
the data in some infinite-dimensional space, with the Gaussian kernel from Example 12.9 
being one such case. 


Let us consider a particular form of feature map that plays an important role in subsequent 
analysis: 


Example 12.10 (PSD kernels from basis expansions) Consider the sinusoidal Fourier basis 
functions (x) := sin (625) forall je N = {1,2,...}. By construction, we have 


1 ifj=k, 


0 otherwise, 


1 
(bj, Qro = ui $ (xX) Ox) dx = i 


so that these functions are orthonormal in Z°[0, 1]. Now given some sequence (u Dea of 
non-negative weights for which >)", 4j < co, let us define the feature map 


D) = (VGA), VBA), ViBbs(0), ---): 


By construction, the element ®(x) belongs to €’(IN), since 


[PDI Pag) = > MiP) SD My < 00. 
j=l 


j=l 
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Consequently, this particular choice of feature map defines a PSD kernel of the form 


K(x, 2) = (P(x), PD ew = > MiG (0,2 
JEI 
As our development in the sequel will clarify, a very broad class of PSD kernel functions 
can be generated in this way. & 


12.2.3 Constructing an RKHS from a kernel 


In this section, we show how any positive semidefinite kernel function K defined on the 
Cartesian product space XxX can be used to construct a particular Hilbert space of functions 
on X. This Hilbert space is unique, and has the following special property: for any x € X, 
the function K(-, x) belongs to H, and satisfies the relation 


(f, KO, Oy =f) forall feH. (12.3) 


This property is known as the kernel reproducing property for the Hilbert space, and it 
underlies the power of RKHS methods in practice. More precisely, it allows us to think 
of the kernel itself as defining a feature map! x œ> K(-, x) € H. Inner products in the 
embedded space reduce to kernel evaluations, since the reproducing property ensures that 
(KG, x), KC, Zu = K(x, z) for all x,z E€ X. As mentioned earlier, this computational 
benefit of the RKHS embedding is often referred to as the kernel trick. 

How does one use a kernel to define a Hilbert space with the reproducing property (12.3)? 
Recalling the definition of a Hilbert space, we first need to form a vector space of functions, 
and then we need to endow it with an appropriate inner product. Accordingly, let us begin by 
considering the set H of functions of the form f(-) = È= a;K(-, xj) for some integer n > 1, 


set of points {x;}’_, C X and weight vector a € R”. It is easy to see that the set H forms a 
vector space under the usual definitions of function addition and scalar multiplication. 

Given any pair of functions f, f in our vector space—let us suppose that they take the 
form f(-) = pas aj;K(-, xj) and f(-) = Yip; &K(-, X,)—we propose to define their inner 
product as 


Cf, he = YY) ae K (eH). (12.4) 
j=l k=l 
It can be verified that this definition is independent of the particular representation of the 
functions f and f. Moreover, this proposed inner product does satisfy the kernel reproducing 
property (12.3), since by construction, we have 


n 
(f, KC Da = Ds K (x; x) = FO. 
j=l 
Of course, we still need to verify that the definition (12.4) defines a valid inner prod- 
uct. Clearly, it satisfies the symmetry (12.1a) and linearity requirements (12.1c) of an inner 
' This view—with the kernel itself defining an embedding from X to H—is related to but slightly different than 


our earlier perspective, in which the feature map ® was a mapping from X to ¢7(\N). Mercer’s theorem allows 
us to connect these two points of view; see equation (12.14) and the surrounding discussion. 
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product. However, we need to verify the condition (12.1b)—namely, that (f, f),, = O with 
equality if and only if f = 0. After this step, we will have a valid inner product space, and 
the final step is to take closures of it (in a suitable sense) in order to obtain a Hilbert space. 
With this intuition in place, we now provide a formal statement, and then prove it: 


Theorem 12.11 Given any positive semidefinite kernel function K, there is a unique 
Hilbert space H in which the kernel satisfies the reproducing property (12.3). It is 
known as the reproducing kernel Hilbert space associated with K. 


Proof As outlined above, there are three remaining steps in the proof, and we divide our 
argument accordingly. 


Verifying condition (12.1b): The positive semidefiniteness of the kernel function K implies 
that || fll = (f, Aq = 0 for all f, so we need only show that ales = 0 if and only if f = 0. 
Consider a function of the form f(-) = Xi- a@K(, xi), and suppose that 


Cf, Ai = >) cies K (xy, x) = 0. 


ij=l 


We must then show that f = 0, or equivalently that f(x) = Xi- a;/K(x, xi) = 0 forall x € X. 
Let (a, x) € R x X be arbitrary, and note that by the positive semidefiniteness of K, we have 


0 < |laKG, x) + > aiK, xD, =@ K(x, x) + 2a 2 ajK(x, xi). 


i=1 i=1 


Since K(x, x) > 0 and the scalar a € R is arbitrary, this inequality can hold only if 
1 &iK(x, xi) = 0. Thus, we have shown that the pair (H, (-, -)g;) is an inner product space. 


Completing the space: It remains to extend Htoa complete inner product space—that is, 
a Hilbert space—with the given reproducing kernel. If (f,,)~., is a Cauchy sequence in H, 
then for each x € X, the sequence (f,(x))””, is Cauchy in R, and so must converge to some 
real number. We can thus define the pointwise limit function f(x) := lim). fn(x), and we 
let H be the completion of H by these objects. We define the norm of the limit function f as 
If lly = imao Ifall. 

In order to verify that this definition is sensible, we need to show that for any Cauchy se- 
quence (g,,)°_, in H such that lim, n(x) = 0 for all x € X, we also have lim, [Ignllq = 0. 
Taking subsequences as necessary, suppose that limy_,.0 IIgull= = 2e > 0, so that for n, m suf- 


[oe] 
n/n=1 


2 is Cauchy, 


ficiently large, we have || Salle > eand R > €. Since the sequence (g 


we also have ||g, — 8ml < €/2 for n,m sufficiently large. Now since g,, € H, we can write 


Bn) = pee a;K(-, xi), for some finite positive integer N, and vector a € R`». By the 
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reproducing property, we have 


Nm 
(Sms nn = > Qign(xi) >0 asn — +00, 
i=l 
since g,(x) — O for each fixed x. Hence, for n sufficiently large, we can ensure that 
K8m> 8ndql < €/2. Putting together the pieces, we have 


ln = Sulla = lgl + AR z 2 (8n; EmN 2eEt+tE-E=E. 


But this lower bound contradicts the fact that |lgn — 8nllg < €/2. 
Thus, the norm that we have defined is sensible, and it can be used to define an inner 
product on H via the polarization identity 


(fr 8a = 3 {IMF + ell — NAIR + lelli) - 


With this definition, it can be shown that (K(-, x), fy, = f(x) for all f € H, so that K(-, x) 
is again reproducing over H. 


Uniqueness: Finally, let us establish uniqueness. Suppose that G is some other Hilbert 
space with K as its reproducing kernel, so that K(-, x) € G for all x € X. Since G is 
complete and closed under linear operations, we must have H c G. Consequently, H is a 
closed linear subspace of G, so that we can write G = H @ H+. Let g € H+ be arbitrary, and 
note that K(-, x) € H. By orthogonality, we must have 0 = (K(-, x), g)g = g(x), from which 
we conclude that H+ = {0}, and hence that H = C as claimed. 


12.2.4 A more abstract viewpoint and further examples 


Thus far, we have seen how any positive semidefinite kernel function can be used to build 
a Hilbert space in which the kernel satisfies the reproducing property (12.3). In the context 
of the Riesz representation theorem (Theorem 12.5), the reproducing property is equivalent 
to asserting that the function K(-, x) acts as the representer for the evaluation functional 
at x—namely, the linear functional Ly: H — R that performs the operation f = f(x). 
Thus, it shows that in any reproducing kernel Hilbert space, the evaluation functionals are 
all bounded. This perspective leads to the natural question: How large is the class of Hilbert 
spaces for which the evaluation functional is bounded? It turns out that this class is exactly 
equivalent to the class of reproducing kernel Hilbert spaces defined in the proof of Theo- 
rem 12.11. Indeed, an alternative way in which to define an RKHS is as follows: 


Definition 12.12 A reproducing kernel Hilbert space H is a Hilbert space of real- 
valued functions on X such that for each x € X, the evaluation functional L,: H —> R 
is bounded (i.e., there exists some M < co such that |L,(f)| < M||f|lp for all f € H). 


Theorem 12.11 shows that any PSD kernel can be used to define a reproducing kernel 
Hilbert space in the sense of Definition 12.12. In order to complete the equivalence, we need 
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to show that all Hilbert spaces specified by Definition 12.12 can be equipped with a repro- 
ducing kernel function. Let us state this claim formally, and then prove it: 


Theorem 12.13 Given any Hilbert space H in which the evaluation functionals are all 
bounded, there is a unique PSD kernel K that satisfies the reproducing property (12.3). 


Proof When L, is a bounded linear functional, the Riesz representation (Theorem 12.5) 
implies that there must exist some element R, of the Hilbert space H such that 


fQ)=LAf)=(f, Roy forall f eH. (12.5) 


Using these representers of evaluation, let us define a real-valued function K on the Carte- 
sian product space X x X via K(x, z) := (R,, R,)4. Symmetry of the inner product ensures 
that K is a symmetric function, so that it remains to show that K is positive semidefinite. 
For any n 2 1, let {x;}_, © X be an arbitrary collection of points, and consider the n x n 
matrix K with elements K;; = K(x;, x;). For an arbitrary vector œ € R”, we have 


a'Ka = ey jax K(Xj, XK) = > ajRx,, > aR.) = || oy aR 20, 
j=l j=l H yet 


j= 


which proves the positive semidefiniteness. 

It remains to verify the reproducing property (12.3). It actually follows easily, since for 
any x € X, the function K(-, x) is equivalent to R,(-). In order to see this equivalence, note 
that for any y € X, we have 

®© (ii) 
KO, x) = (Ry, Rod = RQ), 

where step (i) follows from our original definition of the kernel function, and step (ii) follows 
since R, is the representer of evaluation at y. It thus follows that our kernel satisfies the re- 
quired reproducing property (12.3). Finally, in Exercise 12.4, we argue that the reproducing 
kernel of an RKHS must be unique. 


Let us consider some more examples to illustrate our different viewpoints on RKHSs. 


Example 12.14 (Linear functions on R) In Example 12.7, we showed that the linear kernel 
K(x, z) = (x, z) is positive semidefinite on R°. The constructive proof of Theorem 12.11 
dictates that the associated RKHS is generated by functions of the form 


n n 
zh >» qi kz, Xj) = (z > avs). 
i=1 i=1 


Each such function is linear, and therefore the associated RKHS is the class of all linear 
functions—that is, functions of the form /,(-) = (-, 6) for some vector 8 € R”. The in- 


duced inner product is given by ( Ío, A = (p, B). Note that for each z € Rf, the function 
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K(-, z) = C, 2) = fz is linear. Moreover, for any linear function fg, we have 


(fa, KC. 2D), = B, 2 = fol, 
which provides an explicit verification of the reproducing property (12.3). + 


Definition 12.12 and the associated Theorem 12.13 provide us with one avenue of verify- 
ing that a given Hilbert space is not an RKHS, and so cannot be equipped with a PSD kernel. 
In particular, the boundedness of the evaluation functionals R, in an RKHS has a very im- 
portant consequence: in particular, it ensures that convergence of a sequence of functions in 
an RKHS implies pointwise convergence. Indeed, if f, — f* in the Hilbert space norm, then 
for any x € X, we have 


A- FO = Re fa- ful < Ralf- fla > 0, (12.6) 


where we have applied the Cauchy—Schwarz inequality. This property is not shared by an 
arbitrary Hilbert space, with the Hilbert space L7[0, 1] from Example 12.4 being one case 
where this property fails. 


Example 12.15 (The space L7[0, 1] is not an RKHS) From the argument above, it suf- 
fices to provide a sequence of functions (f,)~, that converge to the all-zero function in 
L’[0, 1], but do not converge to zero in a pointwise sense. Consider the sequence of func- 
tions f,(x) = x” for n = 1,2,.... Since ie i) dx = su. this sequence is contained in 
L?[0, 1], and moreover Ilfallz2;0,11 > 0. However, f,(1) = 1 for all n = 1,2,..., so that this 
norm convergence does not imply pointwise convergence. Thus, if L7[0, 1] were an RKHS, 
then this would contradict inequality (12.6). 

An alternative way to see that L7[0, 1] is not an RKHS is to ask whether it is possible to 
find a family of functions {R, € L7[0, 1], x € [0, 1]} such that 


1 
f FOR) dy =f) forall erwi 
0 


This identity will hold if we define R, to be a “delta-function”—that is, infinite at x and zero 
elsewhere. However, such objects certainly do not belong to L7[0, 1], and exist only in the 
sense of generalized functions. & 


Although L?(0, 1] itself is too large to be a reproducing kernel Hilbert space, we can obtain 
an RKHS by imposing further restrictions on our functions. One way to do so is by imposing 
constraints on functions and their derivatives. The Sobolev spaces form an important class 
that arise in this way: the following example describes a first-order Sobolev space that is an 
RKHS. 


Example 12.16 (A simple Sobolev space) A function f over [0,1] is said to be abso- 
lutely continuous (or abs. cts. for short) if its derivative f’ exists almost everywhere and is 
Lebesgue-integrable, and we have f(x) = f(0) + f f'(z)dz for all x € [0, 1]. Now consider 
the set of functions 


H'[0, 1] :={f: [0,1] > R | f(0) = 0, and f is abs. cts. with f” € L’[0, 1]}. (12.7) 


Let us define an inner product on this space via (f, g)p ‘= if I’ @s'(z) dz; we claim that 
the resulting Hilbert space is an RKHS. 
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One way to verify this claim is by exhibiting a representer of evaluation: for any x € [0, 1], 
consider the function R,(z) = min{x, z}. It is differentiable at every point z € [0, 1] \ {x}, and 
we have R’ (z) = lho,.4(z), corresponding to the binary-valued indicator function for member- 
ship in the interval [0, x]. Moreover, for any z € [0, 1], it is easy to verify that 


Z 
minsa = f ljo, (u) du, (12.8) 
0 


so that R, is absolutely continuous by definition. Since R,(0) = 0, we conclude that R, is an 
element of H'[0, 1]. Finally, to verify that R, is the representer of evaluation, we calculate 


1 x 
(f, Rey =i FOR dz = | f(z) dz = f(x), 
0 0 


where the final equality uses the fundamental theorem of calculus. 

As shown in the proof of Theorem 12.13, the function K(-, x) is equivalent to the rep- 
resenter R,(-). Thus, the kernel associated with the first-order Sobolev space on [0, 1] is 
given by K(x, z) = R,(z) = min{x, z}. To confirm that is positive semidefinite, note that 
equation (12.8) implies that 


1 
K(x, z) = f Dox) Djo,q(w) du = Clio; loz)z210,17> 
0 


thereby providing a Gram representation of the kernel that certifies its PSD nature. We con- 
clude that K(x, z) = min{x, z} is the unique positive semidefinite kernel function associated 
with this first-order Sobolev space. + 


Let us now turn to some higher-order generalizations of the first-order Sobolev space from 
Example 12.16. 


Example 12.17 (Higher-order Sobolev spaces and smoothing splines) For some fixed in- 
teger a > 1, consider the class H®[0, 1] of real-valued functions on [0, 1] that are œ-times 
differentiable (almost everywhere), with the a-derivative f©® being Lebesgue-integrable, 
and such that f(0) = f(0) = --- = f@"-P(0) = 0. (Here f denotes the kth-order derivative 
of f.) We may define an inner product on this space via 


1 
(F, 8u c= T FOC) dz. (12.9) 
0 


Note that this set-up generalizes Example 12.16, which corresponds to the case a = 1. 
We now claim that this inner product defines an RKHS, and more specifically, that the 
kernel is given by 


l a-z o- 
Geil Gan” 


K(x, y) = 


where (t), := max{0, t}. Note that the function R,(-) := K(-, x) is a-times differentiable 
almost everywhere on [0, 1] with RO (y) =(x- yes /(a — 1)!. To verify that R, acts as the 
representer of evaluation, recall that any function f: [0,1] — R that is a-times differentiable 
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almost everywhere has the Taylor-series expansion 


24 
O0 o dz. 12.10 
f(x) = 5 roz +f ret Cat ae (12.10) 
Using the previously mentioned properties of R, and the definition (12.9) of the inner prod- 
uct, we obtain 


—z! 
-1)! 
where the final equality uses the Taylor-series expansion (12.10), and the fact that the first 
(œ — 1) derivatives of f vanish at 0. 
In Example 12.29 to follow, we show how to augment the Hilbert space so as to remove 
the constraint on the first (œ — 1) derivatives of the functions f. 4 


Ry Ay = T PoE dz = f(x), 


12.3 Mercer’s theorem and its consequences 


We now turn to a useful representation of a broad class of positive semidefinite kernel func- 
tions, namely in terms of their eigenfunctions. Recall from classical linear algebra that any 
positive semidefinite matrix has an orthonormal basis of eigenvectors, and the associated 
eigenvalues are non-negative. The abstract version of Mercer’s theorem generalizes this de- 
composition to positive semidefinite kernel functions. 

Let P be a non-negative measure over a compact metric space X, and consider the function 
class L?(X; P) with the usual squared norm 


lihan = | Pdo. 


Since the measure P remains fixed throughout, we frequently adopt the shorthand notation 
L?(X) or even just L? for this norm. Given a symmetric PSD kernel function K: X xX > R 
that is continuous, we can define a linear operator T on L?(X) via 


Tx(f)(x) := | xe: z) f(z) dP(z). (12.11a) 
x 
We assume that the kernel function satisfies the inequality 


K(x, z)dP(x) dP(z) < œ, (12.11b) 
XXX 


which ensures that Tx is a bounded linear operator on L7(X). Indeed, we have 
2 
ITKA = f ( f Kx, YF dP) dPO) 


< Flo ih Gy) AP) APO», 


where we have applied the Cauchy—Schwarz inequality. Operators of this type are known as 
Hilbert-Schmidt operators. 
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Let us illustrate these definitions with some examples. 


Example 12.18 (PSD matrices) Let X = [d] := {1,2,...,d} be equipped with the Ham- 
ming metric, and let P({j}) = 1 for all j € {1,2,...,d} be the counting measure on this 
discrete space. In this case, any function f: X — R can be identified with the d-dimensional 
vector (f(1),..., f(d)), and a symmetric kernel function K: X x X — R can be identified 
with the symmetric d x d matrix K with entries K;; = K(i, j). Consequently, the integral 
operator (12.11a) reduces to ordinary matrix—vector multiplication 


d 
Tr(a) = | Kx, DOPO = Y Ka af. 
X z=1 


By standard linear algebra, we know that the matrix K has an orthonormal collection of 
eigenvectors in R4, say {vı,.. ., va}, along with a set of non-negative eigenvalues u > 42 > 
+++ > Ha, such that 


d 
K= $ upp. (12.12) 
j=l 


Mercer’s theorem, to be stated shortly, provides a substantial generalization of this decom- 
position to a general positive semidefinite kernel function. & 


Example 12.19 (First-order Sobolev kernel) Now suppose that X = [0,1], and that P is 
the Lebesgue measure. Recalling the kernel function K(x, z) = min{x, z}, we have 


1 X 1 
Tx(f)(x) = { min{x, z} f(z) dz = di zf(z)dz + f xf (z) dz. 
0 0 x 
We return to analyze this particular integral operator in Example 12.23. & 


Having gained some intuition for the general notion of a kernel integral operator, we are 
now ready for the statement of the abstract Mercer’s theorem. 


Theorem 12.20 (Mercer’s theorem) Suppose that X is compact, the kernel function 
K is continuous and positive semidefinite, and satisfies the Hilbert-Schmidt condi- 
tion (12.11b). Then there exist a sequence of eigenfunctions ($j ee that form an or- 
thonormal basis of L?(X;P), and non-negative eigenvalues (u Die such that 


Tx(ġ;)=ujġ; for j =1,2,.... (12.13a) 


Moreover, the kernel function has the expansion 


K(x, 2) = X O, (12.13b) 


j=1 


where the convergence of the infinite series holds absolutely and uniformly. 
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Remarks: The original theorem proved by Mercer applied only to operators defined on 
L?({a, b]) for some finite a < b. The more abstract version stated here follows as a conse- 
quence of more general results on the eigenvalues of compact operators on Hilbert spaces; 
we refer the reader to the bibliography section for references. 

Among other consequences, Mercer’s theorem provides intuition on how reproducing 
kernel Hilbert spaces can be viewed as providing a particular embedding of the function 
domain X into a subset of the sequence space ¢(IN). In particular, given the eigenfunctions 
and eigenvalues guaranteed by Mercer’s theorem, we may define a mapping ®: X > (N) 
via 


x O(x) = ( ym 610), Vin d(x), Vib pa), ual (12.14) 


By construction, we have 
IP Faq) = >, UÉ = K(x, x) < 00, 
zl 


showing that the map x + (x) is a type of (weighted) feature map that embeds the original 
vector into a subset of (N). Moreover, this feature map also provides an explicit inner 
product representation of the kernel over €7(IN)—namely 


(P(x), DDr =) Hj G2) 02) = Kl, 2. 


j=l 


Let us illustrate Mercer’s theorem by considering some examples: 


Example 12.21 (Eigenfunctions for a symmetric PSD matrix) As discussed in Exam- 
ple 12.18, a symmetric PSD d-dimensional matrix can be viewed as a kernel function on 
the space [d] x [d], where we adopt the shorthand [d] := {1,2,...,d}. In this case, the eigen- 
function ¢;: [d] — R can be identified with the vector v; := (¢,(1),...,¢j;(d)) € R7. Thus, 
in this special case, the eigenvalue equation Tx(¢;) = jd; is equivalent to asserting that 
v; € Rf is an eigenvector of the kernel matrix. Consequently, the decomposition (12.13b) 
then reduces to the familiar statement that any symmetric PSD matrix has an orthonormal 
basis of eigenfunctions, with associated non-negative eigenvalues, as previously stated in 
equation (12.12). & 


Example 12.22 (Eigenfunctions of a polynomial kernel) Let us compute the eigenfunc- 
tions of the second-order polynomial kernel K(x, z) = (1 + xz)? defined over the Cartesian 
product [—1, 1] x [-1, 1], where the unit interval is equipped with the Lebesgue measure. 
Given a function f: [—1, 1] — R, we have 


1 1 
i K(x, Df (2) dz = i: (1 + 2x2 + x72?) fe) dz 
-1 -1 


1 1 1 
7 d 2 dz fdz x’, 
IES ch +{ IER ches] | èro zh 


showing that any eigenfunction of the kernel integral operator must be a polynomial of 
degree at most two. Consequently, the eigenfunction problem can be reduced to an ordinary 
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eigenvalue problem in terms of the coefficients in the expansion f(x) = ay + a,x + apx. 


Following some simple algebra, we find that, if f is an eigenfunction with eigenvalue q, 
then these coefficients must satisfy the linear system 


2 0 2/3) lao ao 
0 43 O }la;}=pIa1]. 
2/3 0 2/5} l{a2 a2 


Solving this ordinary eigensystem, we find the following eigenfunction-eigenvalue pairs 


ġı(x) = —0.9403 — 0.3404’, with p, = 2.2414, 
(x) = x, with py = 1.3333, 
(x) = —0.3404 + 0.9403’, with u3 = 0.1586. 4 


Example 12.23 (Eigenfunctions for a first-order Sobolev space) In Example 12.16, we in- 
troduced the first-order Sobolev space H'[0, 1]. In Example 12.19, we found that its kernel 
function takes the form K(x, z) = min{x, z}, and determined the form of the associated inte- 
gral operator. Using this previous development, if ¢: [0, 1] — R is an eigenfunction of Tx 
with eigenvalue u + 0, then it must satisfy the relation Tx%(@) = ud, or equivalently 


x 1 
f z@(z) dz + f x(z) dz = uġ(x) for all x € [0, 1]. 
0 x 


Since this relation must hold for all x € [0, 1], we may take derivatives with respect to x. Do- 
ing so twice yields the second-order differential equation u” (x) + (x) = 0. Combined with 
the boundary condition (0) = 0, we obtain ¢(x) = sin(x/ y4) as potential eigenfunctions. 


Now using the boundary condition f z6(z) dz = ud(1), we deduce that the eigenfunction- 
eigenvalue pairs are given by 


(2j-1)at 


ġ;(t) = sin 5 


2 2 
for j =1,2,.... 
l)a 


& 


Example 12.24 (Translation-invariant kernels) An important class of kernels have a trans- 
lation-invariant form. In particular, given a function y: [-1,1] — R that is even (meaning 
that y(u) = w(—u) for all u € [—1, 1]), let us extend its domain to the real line by the periodic 
extension y(u + 2k) = y(u) for all u € [-1, 1] and integers k € Z. 

Using this function, we may define a translation-invariant kernel on the Cartesian product 
space [—1, 1] x [-1, 1] via K(x, z) = W(x — z). Note that the evenness of y ensures that this 
kernel is symmetric. Moreover, the kernel integral operator takes the form 


1 
TKP) = ke Wx — 2f(z) dz, 
ee 


Wr NO) 


and thus is a convolution operator. 
A classical result from analysis is that the eigenfunctions of convolution operators are 
given by the Fourier basis; let us prove this fact here. We first show that the cosine functions 
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$j(x) = cos(z jx) for j = 0,1,2,... are eigenfunctions of the operator Tx. Indeed, we have 


1 1-x 
Tx(Gj)(X) = T W(x — z) cos(r jz) dz = ii W(—u) cos(2z j(x + u)) du, 
ai -l-x 


where we have made the change of variable u = z — x. Note that the interval of integration 
[-1 -— x, 1- x]is of length 2, and since both y(—u) and cos(27(x + u)) have period 2, we can 
shift the interval of integration to [-1, 1]. Combined with the evenness of y, we conclude 
that Tx(¢;)(x) = f ; y(u) cos(27 j(x + u)) du. Using the elementary trigonometric identity 


cos(m j(x + u)) = cos(ajx) cos(m ju) — sin(ajx) sin(z ju), 


we find that 


1 1 
Tx(ġ;)\(x) = {f y(u) cos(m ju) du) cos(z jx) — ii y(u) sin(z ju) du) sin(r jx) 
-1 -1 
= cj cos(7 jx), 


1 . . > ; ; ; 
where cj = f _ Y(u) cos(7 ju) du is the jth cosine coefficient of y. In this calculation, we have 
used the evenness of y to argue that the integral with the sine function vanishes. 
A similar argument shows that each of the sinusoids 


P = sin(jxx) for j = 1,2,... 


are also eigenfunctions with eigenvalue c;. Since the functions {¢;, j = 0,1,2,...} U (Pj, 
j= 1,2,...} form a complete orthogonal basis of L?[-1, 1], there are no other eigenfunctions 
that are not linear combinations of these functions. Consequently, by Mercer’s theorem, the 
kernel function has the eigenexpansion 


K(x, Z) = > c;{ cos(ajx) cos(7z jz) + sin(r jx) sin(zjz)} = > cj cos(wj(x — z)), 
j=0 j=0 


where c; are the (cosine) Fourier coefficients of y. Thus, we see that K is positive semi- 
definite if and only if c; > 0 for j = 0,1,2,.... & 


Example 12.25 (Gaussian kernel) As previously introduced in Example 12.9, a popu- 
lar choice of kernel on some subset X c R’ is the Gaussian kernel given by K(x, z) = 
exp- Ht), where o > 0 is a bandwidth parameter. To keep our calculations relatively 
simple, let us focus here on the univariate case d = 1, and let X be some compact interval of 
the real line. By a rescaling argument, we can restrict ourselves to the case X = [-1, 1], so 
that we are considering solutions to the integral equation 


1 3 
i ex? p2) dz = ujpa). (12.15) 


Note that this problem cannot be tackled by the methods of the previous example, since we 
are not performing the periodic extension of our function.? Nonetheless, the eigenvalues of 
the Gaussian integral operator are very closely related to the Fourier transform. 


2 If we were to consider the periodically extended version, then the eigenvalues would be given by the cosine 
; 1 2 s ; : : : ; 
coefficients cj = f exp (—- 32) cos(x ju) du, with the cosine functions as eigenfunctions. 
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In the remainder of our development, let us consider a slightly more general integral 
equation. Given a bounded, continuous and even function Y: R — [0, co), we may define its 
(real-valued) Fourier transform y(u) = f ie Wwe" dw, and use it to define a translation- 
invariant kernel via K(x, z) := W(x — z). We are then led to the integral equation 


1 
i Wor = DO dz = uipa) (12.16) 


Classical theory on integral operators can be used to characterize the spectrum of this inte- 
gral operator. More precisely, for any operator such that log ¥(w) x -w° for some a > 1, 
there is a constant c such that the eigenvalues (1)? , associated with the integral equation 
(12.16) scale as u; x e~©/'°8/ as j > +00. See the bibliographic section for further discussion 


of results of this type. 


The Gaussian kernel is a special case of this set-up with the pair Y(w) = exp(-=*) and 
y(u) = exp(-#5). Applying the previous reasoning guarantees that the eigenvalues of the 
Gaussian kernel over a compact interval scale as uw; x exp(—cjlog j) as j — +o. We 
thus see that the Gaussian kernel class is relatively small, since its eigenvalues decay at 
exponential rate. (The reader should contrast this fast decay with the significantly slower 


Hj = j° decay rate of the first-order Sobolev class from Example 12.23.) 4 


An interesting consequence of Mercer’s theorem is in giving a relatively explicit charac- 
terization of the RKHS associated with a given kernel. 


d D 
Corollary 12.26 Consider a kernel satisfying the conditions of Mercer’s theorem with 
associated eigenfunctions (ġ int and non-negative eigenvalues (u Dier It induces the 
reproducing kernel Hilbert space 

oo 2 g 
H := l; = *_B;ġ; | for some (B))", € N) with X2 < =. (12.17a) 
j=l j=1 f7 


along with inner product 


Gon ERICA (12.17b) 


=I Bj 


where {-, -) denotes the inner product in L?(X; P). 
p 


Let us make a few comments on this claim. First, in order to assuage any concerns regarding 
division by zero, we can restrict all sums to only indices j for which u; > 0. Second, note 
that Corollary 12.26 shows that the RKHS associated with a Mercer kernel is isomorphic to 
an infinite-dimensional ellipsoid contained with ¢?(IN)—namely, the set 


o0 2 
6:= (oe c eN) ye < 1}. (12.18) 
gala 
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We study the properties of such ellipsoids at more length in Chapters 13 and 14. 


Proof For the proof, we take u; > O for all j € N. This assumption entails no loss of 
generality, since otherwise the same argument can be applied with relevant summations 
truncated to the positive eigenvalues of the kernel function. Recall that <, -) denotes the 
inner product on L?(X; P). 

It is straightforward to verify that H along with the specified inner product (-, -),, is a 
Hilbert space. Our next step is to show that H is in fact a reproducing kernel Hilbert space, 
and satisfies the reproducing property with respect to the given kernel. We begin by showing 
that for each fixed x € X, the function K(., x) belongs to H. By the Mercer expansion, 
we have K(-, x) = Dial 1; ;(x),(-), so that by definition (12.17a) of our Hilbert space, it 
suffices to show that vial H KA) < oo. By the Mercer expansion, we have 


PWH = Ka, x) < 00, 
j=l 
so that K(-, x) € H. 

Let us now verify the reproducing property. By the orthonormality of (@; jal in L?(X; P) 
and Mercer’s theorem, we have (K(-, x), 6j) = ujġ;(x) for each j € IN. Thus, by defini- 
tion (12.17b) of our Hilbert inner product, for any f € H, we have 

> (f, $j) (KC, x), $j 
G KG dm= X mma 
j 


jl 


a DA 9A = fO, 
j=l 


where the final step again uses the orthonormality of (¢;);°,. Thus, we have shown that H is 
the RKHS with kernel K. (As discussed in Theorem 12.11, the RKHS associated with any 
given kernel is unique.) 


12.4 Operations on reproducing kernel Hilbert spaces 


In this section, we describe a number of operations on reproducing kernel Hilbert spaces 
that allow us to build new spaces. 


12.4.1 Sums of reproducing kernels 


Given two Hilbert spaces H; and H, of functions defined on domains X, and X2, respec- 
tively, consider the space 


Hi +H :={fitAlfeH; j=1,2}, 


corresponding to the set of all functions obtained as sums of pairs of functions from the two 
spaces. 
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Proposition 12.27 Suppose that Hı and H3 are both RKHSs with kernels Kı and Ko, 
respectively. Then the space H = Hı + Hz with norm 


Ifig = min (Ale, + Walle.) (12.19) 
fieHı, peHz 


is an RKHS with kernel K = K, + K. 


Remark: This construction is particularly simple when H; and H, share only the constant 
zero function, since any function f € H can then be written as f = fı + fo for a unique pair 
(fi. f2), and hence || fI, = IAG, + lll- Let us illustrate the use of summation with some 
examples: 


Example 12.28 (First-order Sobolev space and constant functions) Consider the kernel 
functions on [0, 1] x [0, 1] given by K(x, z) = 1 and K(x, z) = min{x, z}. They generate the 
reproducing kernel Hilbert spaces 


Hı = span{1} and H= H'[0, 1], 


where span{1} is the set of all constant functions, and H![0, 1] is the first-order Sobolev 
space from Example 12.16. Note that Hı N H2 = {0}, since f(0) = 0 for any element of H. 
Consequently, the RKHS with kernel K(x, z) = 1 + min{x, z} consists of all functions 


H![0, 1] := {f: [0,1] — R | f is absolutely continuous with f’ € L7[0, 1]}, 


equipped with the squared norm lAo. = f?(0) + Leo, dz. & 


As a continuation of the previous example, let us describe an extension of the higher-order 
Sobolev spaces from Example 12.17: 

Example 12.29 (Extending higher-order Sobolev spaces) For an integer œ > 1, consider 
the kernel functions on [0, 1] x [0, 1] given by 


a-l æ 


1 _ 4-1 _ yye-l 
and =K(x,z) = CH 


E aA 
ME E o (@-1)! (@-1)! 


The first kernel generates an RKHS H, of polynomials of degree œ — 1, whereas the second 
kernel generates the œ-order Sobolev space H, = H°[0, 1] previously defined in Exam- 
ple 12.17. 

Letting f denote the £th-order derivative, recall that any function f € H°[0, 1] satisfies 
the boundary conditions f (0) = 0 for £ = 0,1,...,@ — 1. Consequently, we have Hı NH = 
{0} so that Proposition 12.27 guarantees that the kernel 


ved | f -D @- WF" 


(@- 1D! (@-1)! Cene 


generates the Hilbert space H°[0, 1] of all functions that are a-times differentiable almost 
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everywhere, with f Lebesgue-integrable. As we verify in Exercise 12.15, the associated 
RKHS norm takes the form 


a-l 1 
Alls = >) FOO)? + T KOA (12.21) 
€=0 0 


& 


Example 12.30 (Additive models) It is often convenient to build up a multivariate func- 
tion from simpler pieces, and additive models provide one way in which to do so. For 
j = 1,2,...,M, let H; be a reproducing kernel Hilbert space, and let us consider func- 
tions that have an additive decomposition of the form f = pies fi, where f; € Hj. By 
Proposition 12.27, the space H of all such functions is itself an RKHS equipped with the 
kernel function K = D K;. A commonly used instance of such an additive model is 
when the individual Hilbert space H; corresponds to functions of the jth coordinate of a 
d-dimensional vector, so that the space H consists of functions f: R? > R that have the 
additive decomposition 


d 
Tiresa Le), 
j=l 


where fj: R — R is a univariate function for the jth coordinate. Since H; N Hg = {0} for 
all j + k, the associated Hilbert norm takes the form ||f Ilr, = he II Fille, We provide some 
additional discussion of these additive decompositions in Exercise 13.9 and Example 14.11 
to follow in later chapters. 

More generally, it is natural to consider expansions of the form 


d 
f(%1,...,Xa) = X fi +) fil, x4) sito 
j=l j#k 
When the expansion functions are chosen to be mutually orthogonal, such expansions are 
known as functional ANOVA decompositions. 4 
We now turn to the proof of Proposition 12.27. 


Proof Consider the direct sum F := H; @ H, of the two Hilbert spaces; by definition, it is 
the Hilbert space {(fi, f2) | f; € Hj, j = 1,2} of all ordered pairs, along with the norm 


Ais ANÈ = AB, +A, (12.22) 


Now consider the linear operator L: F — H defined by (fi, f) = fi + f, and note that 
it maps F onto H. The nullspace N(L) of this operator is a subspace of F, and we claim 
that it is closed. Consider some sequence ((f,, —f)),, contained within the nullspace N(L) 
that converges to a point (f, g) € F. By the definition of the norm (12.22), this convergence 
implies that f, — f in H, (and hence pointwise) and —f,, — g in H, (and hence pointwise). 
Overall, we conclude that f = —g, meaning (f, g) € IN(L). 

Let N+ be the orthogonal complement of N(L) in F, and let L, be the restriction of L to 
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N+. Since this map is a bijection between N+ and H, we may define an inner product on H 
via 


(F, 8u = LTA, LEO. 


It can be verified that the space H with this inner product is a Hilbert space. 

It remains to check that H is an RKHS with kernel K = Kı + K, and that the norm 
Il- Iĝ takes the given form (12.19). Since the functions Kı(-, x) and K2(-, x) belong to H; 
and H3, respectively, the function K(,x) = KiC, x) + K2(-, x) belongs to H. For a fixed 
f € F, let (fi, fh) = LI (f) € F, and for a fixed x € X, let (g1, g2) = L{'(K(-,x)) € F. 
Since (g1 — Ki(-, x), 22 — K2(-, x)) must belong to N(L), it must be orthogonal (in F) to the 
element (fi, f2) € N+. Consequently, we have ((g; — KiC, x), 82 — Ko, x), Fis fo) = 0, 
and hence 


(fis KiC, x), + (hrs Kı, X)) Hy = (fis 8H, aE (hr, 82), 
=(f, KC, x))n- 


Since (fi, KiC, Dn, (fr, AoC. )) nu, = Ai) + f(x) = f(x), we have established that K has 
the reproducing property. 

Finally, let us verify that the norm || fll} := ||L{'(/)|lr that we have defined is equivalent to 
the definition (12.19). For a given f € H, consider some pair (fi, f2) € F such that f = fi+ f, 
and define (v1, v2) = (fi, f2) — LI! (f). We have 


2 2 ®© 2 Gi) 2 -1 2 Gii) 2 2 
Mfl, + le, = MCA Adlle = i vale +I Olle = Ni vle + Iflg 


where step (i) uses the definition (12.22) of the norm in F, step (ii) follows from the Pythag- 
orean property, as applied to the pair (v1, v2) € IN(L) and Lī! (f) € N+, and step (iii) uses our 
definition of the norm ||f||}. Consequently, we have shown that for any pair fi, f2 such that 
f = fi + f, we have 


2 2 2 
IIB < AIB, + ILA,» 


with equality holding if and only if (vı, v2) = (0,0), or equivalently (fı, f2) = LI! (f). This 
establishes the equivalence of the definitions. 


12.4.2 Tensor products 


Consider two separable Hilbert spaces H; and Hz of functions, say with domains X; and X2, 
respectively. They can be used to define a new Hilbert space, denoted by Hı ® H2, known 
as the tensor product of H; and H2. Consider the set of functions h: X; x X2 —> R that have 
the form 


{h = X figi | for some n € N and such that f; € Hy, g; € Hz for all j € [n]}. 
j=l 
Ith = X'i fig; and h = 1", Jag are two members of this set, we define their inner product 


(h, hm = X, fis Jon (Sir Boma (12.23) 


j=l k=l 
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Note that the value of the inner product depends neither on the chosen representation of h 
nor on that of h; indeed, using linearity of the inner product, we have 


(h, hu = X KAO fi), Br» 
k=1 


where (h © Fo € H; is the function given by x. =œ (AÇ, x2), fom,- A similar argument 
shows that the inner product does not depend on the representation of h, so that the inner 
product (12.23) is well defined. 

It is straightforward to check that the inner product (12.23) is bilinear and symmetric, and 
that (h, hy = Ilall, > 0 for all h € H. It remains to check that ||h||,; = 0 if and only if 
h = 0. Consider some h € H with the representation h = pas Jig; Let @ DPS and (Wi), 
be complete orthonormal bases of IH; and H3, respectively, ordered such that 


span{fi,..-,. fn} Sspan{gy,...,¢,} and span{gy,...,8,} E span{Wi,..., Wn}. 


Consequently, we can write f equivalently as the double summation f = D jar CKD We for 
some set of real numbers {a j,}'|,_,. Using this representation, we are guaranteed the equal- 
ity IFI, = diel De p which shows that || fll} = 0 if and only if œ; = 0 for all (j,k), or 
equivalently f = 0. 


In this way, we have defined the tensor product H = H; ® H, of two Hilbert spaces. The 
next result asserts that when the two component spaces have reproducing kernels, then the 
tensor product space is also a reproducing kernel Hilbert space: 


Proposition 12.31 Suppose that Hı and Hz are reproducing kernel Hilbert spaces of 
real-valued functions with domains X; and X2, and equipped with kernels K, and K, 
respectively. Then the tensor product space H = Hı ® H is an RKHS of real-valued 
functions with domain Xı X X2, and with kernel function 


K((x1, x2), (445%) = Ki, x1) K, x4). (12.24) 


~ 


Proof In Exercise 12.16, it is shown that K defined in equation (12.24) is a positive semi- 
definite function. By definition of the tensor product space H = H, @ Ho, for each pair 
(X1,X2) E X, X Xo, the function K((-,-), (41, %2)) = Ai, XDK, x2) is an element of the 
tensor product space H. Let f = Dike ac jW, be an arbitrary element of H. By definition 
of the inner product (12.23), we have 


(F, KC), (115 42) = oy aj Kil, XH, (We, Ko, x2), 


j=l 


= X a jhi) = f1, x2), 


j=l 


thereby verifying the reproducing property. 
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12.5 Interpolation and fitting 


Reproducing kernel Hilbert spaces are useful for the classical problems of interpolating and 
fitting functions. An especially attractive property is the ease of computation: in particular, 
the representer theorem allows many optimization problems over the RKHS to be reduced 
to relatively simple calculations involving the kernel matrix. 


12.5.1 Function interpolation 


Let us begin with the problem of function interpolation. Suppose that we observe n samples 
of an unknown function f*: X > R, say of the form y; = f*(x;) for i = 1,2,...,n, where the 
design sequence {x;}?_, is known to us. Note that we are assuming for the moment that the 
function values are observed without any noise or corruption. In this context, some questions 
of interest include: 


e For a given function class F, does there exist a function f € F that exactly fits the data, 
meaning that f(x;) = y; for alli = 1,2,...,? 

Of all functions in F that exactly fit the data, which does the “best” job of interpolating 
the data? 


Exact polynomial interpolation First-order spline interpolation 


1.5 1.5 
1 1 
g 0.5 + É 0.5 
© @ 
> > 
= OF en 0h 
2 © 
8 © 
5-0.5 + § -0.5 
E (mg 
-1+ i ; J -1+ i Sa E ET P 
— Fitted function — Fitted function 
© Observed values © Observed values 
15 ; : : ; TOZ i = i a 
-0.5 -0.25 0 0.25 0.5 -0.5 -0.25 0 0.25 0.5 
Design value x Design value x 
(a) (b) 


Figure 12.1 Exact interpolation of n = 11 equally sampled function values using 
RKHS methods. (a) Polynomial kernel K(x, z) = (1 + xz)!’. (b) First-order Sobolev 
kernel K(x, z) = 1 + min{x, z}. 


The first question can often be answered in a definitive way—in particular, by producing a 
function that exactly fits the data. The second question is vaguely posed and can be answered 
in multiple ways, depending on our notion of “best”. In the context of a reproducing kernel 
Hilbert space, the underlying norm provides a way of ordering functions, and so we are led 
to the following formalization: of all the functions that exactly fit the data, choose the one 
with minimal RKHS norm. This approach can be formulated as an optimization problem in 
Hilbert space—namely, 


choose fe arg min If such that f(x;) = y; fori = 1,2,...,n. (12.25) 
€ 
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This method is known as minimal norm interpolation, and it is feasible whenever there ex- 
ists at least one function f € H that fits the data exactly. We provide necessary and sufficient 
conditions for such feasibility in the result to follow. Figure 12.1 illustrates this minimal 
Hilbert norm interpolation method, using the polynomial kernel from Example 12.8 in Fig- 
ure 12.1(a), and the first-order Sobolev kernel from Example 12.23 in Figure 12.1(b). 


For a general Hilbert space, the optimization problem (12.25) may not be well defined, or 
may be computationally challenging to solve. Hilbert spaces with reproducing kernels are 
attractive in this regard, as the computation can be reduced to simple linear algebra involving 
the kernel matrix K € R” with entries Kj; = K(x;, x;)/n. The following result provides 
one instance of this general phenomenon: 


ja A 
Proposition 12.32 Let K € R” be the kernel matrix defined by the design points 
{x}. The convex program (12.25) is feasible if and only if y € range(K), in which 
case any optimal solution can be written as 


~ 1 ne 
FO= ED {BKC x) where K@ = y/yn. 


XM 


Remark: Our choice of normalization by 1/-yn is for later theoretical convenience. 


Proof For a given vector a € R”, define the function f.(-) := Vi X aK, xi), and 
consider the set L := {fẹ | a € R”}. Note that for any fy € L, we have 


1 n 
ee aa 2 aiK(x;, xi) = VnKa);, 


where (Ka); is the jth component of the vector Ka € R”. Thus, the function fy € L sat- 
isfies the interpolation condition if and only if Ka = y/ Wn. Consequently, the condition 
y € range(K) is sufficient. It remains to show that this range condition is necessary, and that 
the optimal interpolating function must lie in L. 

Note that L is a finite-dimensional (hence closed) linear subspace of H. Consequently, 
any function f € H can be decomposed uniquely as f = fy + fi, where fy € L and f, is 
orthogonal to L. (See Exercise 12.3 for details of this direct sum decomposition.) Using this 
decomposition and the reproducing property, we have 


FOD =F, KC, xu = fa + fi, KC, xp) = fex), 


where the final equality follows because K(-, x;) belongs to L, and we have 
(fi, KC, xj))u = O due to the orthogonality of fı and L. Thus, the component f, has 
no effect on the interpolation property, showing that the condition y € range(K) is also a 
necessary condition. Moreover, since fẹ and f, are orthogonal, we are guaranteed to have 
Ifa + fille, = Ifa, + Ifl. Consequently, for any Hilbert norm interpolant, we must have 


fi =0. 
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12.5.2 Fitting via kernel ridge regression 


In a statistical setting, it is usually unrealistic to assume that we observe noiseless observa- 
tions of function values. Rather, it is more natural to consider a noisy observation model, 
say of the form 


yi = f (x) + wi, fori = 1,2,...,n, 


where the coefficients {w;};_; model noisiness or disturbance in the measurement model. In 
the presence of noise, the exact constraints in our earlier interpolation method (12.25) are 
no longer appropriate; instead, it is more sensible to minimize some trade-off between the fit 
to the data and the Hilbert norm. For instance, we might only require that the mean-squared 
differences between the observed data and fitted values be small, which then leads to the 
optimization problem 


1 n 
min|lflly such that =- 3 Yi- FDF < &, (12.26) 


where 6 > 0 is some type of tolerance parameter. Alternatively, we might minimize the 
mean-squared error subject to a bound on the Hilbert radius of the solution, say 


E 3 
min 5 2 (yi- fæ)? such that |\flla < R (12.27) 


for an appropriately chosen radius R > 0. Both of these problems are convex, and so by 
Lagrangian duality, they can be reformulated in the penalized form 


ii 
f= argmin {5 2% — f(x)” + Alf les} (12.28) 


Here, for a fixed set of observations {(x;, y;)}‘_,, the regularization parameter 2, > 0 is a 
function of the tolerance 6 or radius R. This form of function estimate is most convenient 
to implement, and in the case of a reproducing kernel Hilbert space considered here, it is 
known as the kernel ridge regression estimate, or KRR estimate for short. The following re- 
sult shows how the KRR estimate is easily computed in terms of the kernel matrix K € R” 
with entries K;; = K(x;, x;)/n. 


Proposition 12.33 For all A„ > 0, the kernel ridge regression estimate (12.28) can be 
written as 


ea to A 
fO= Ti 2o xi), (12.29) 


where the optimal weight vector @ € R” is given by 


T(R F (12.30) 
n 
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Remarks: Note that Proposition 12.33 is a natural generalization of Proposition 12.32, to 
which it reduces when J, = 0 (and the kernel matrix is invertible). Given the kernel matrix 
K, computing @ via equation (12.30) requires at most O(n?) operations, using standard rou- 
tines in numerical linear algebra (see the bibliography for more details). Assuming that the 
kernel function can be evaluated in constant time, computing the n X n matrix requires an 
additional O(n’) operations. See Figure 12.2 for some illustrative examples. 


Polynomial KRR First-order spline KRR 
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Figure 12.2 Illustration of kernel ridge regression estimates of function f*(x) = 


z - 2x? based on n = 11 samples, located at design points x; = —0.5 + 0.10 (i — 1) 


over the interval [—0.5,0.5]. (a) Kernel ridge regression estimate using the second- 
order polynomial kernel K(x, z) = (1 + xz)? and regularization parameter 2, = 0.10. 
(b) Kernel ridge regression estimate using the first-order Sobolev kernel K(x, z) = 
1 + min{x, z} and regularization parameter J,, = 0.10. 


We now turn to the proof of Proposition 12.33. 


Proof Recall the argument of Proposition 12.32, and the decomposition f = fo + fi. Since 
fi(%i) = 0 for all i = 1,2,...,n, it can have no effect on the least-squares data component 
of the objective function (12.28). Consequently, following a similar line of reasoning to the 
proof of Proposition 12.32, we again see that any optimal solution must be of the specified 
form (12.29). 

It remains to prove the specific form (12.30) of the optimal @. Given a function f of the 
form (12.29), for each j = 1,2,...,n, we have 


1 n 
x)=— > a;K(x;, x) = Vne'Ka, 
fe vn i=l i 
where e; € R” is the canonical basis vector with 1 in position j, and we have recalled that 


Kj = K(x;, xi)/n. Similarly, we have the representation 


1/< £ 
lflle, =- (» AKC, xi), X aK, w) = aKa. 
i=1 j=l 


n 
H 
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Substituting these relations into the cost function, we find that it is a quadratic in the vector 
a, given by 


1 1 2 
-lly — VnKall|} + 4a"Ke = —|ly|} + a (K? + AK)a — —y'Ka. 
n n vn 


In order to find the minimum of this quadratic function, we compute the gradient and set it 
equal to zero, thereby obtaining the stationary condition 


y 
K(K + AI,)a = K—. 
(K + Alja = K2 


Thus, we see that the vector @ previously defined in equation (12.30) is optimal. Note that 
any vector 8 € R” such that K£ = 0 has no effect on the optimal solution. 


We return in Chapter 13 to study the statistical properties of the kernel ridge regression 
estimate. 


12.6 Distances between probability measures 


There are various settings in which it is important to construct distances between probability 
measures, and one way in which to do so is via measuring mean discrepancies over a given 
function class. More precisely, let P and Q be a pair of probability measures on a space X, 
and let F be a class of functions f: X — R that are integrable with respect to P and Q. We 
can then define the quantity 


pz(P,Q) := sup fra - dQ) = sup |E[f(X)] - Elf] - (12.31) 
SJEF JEF 

It can be verified that, for any choice of function class ¥, this always defines a pseudometric, 

meaning that pg satisfies all the metric properties, except that there may exist pairs P + Q 

such that pz(P, Q) = 0. When F is sufficiently rich, then p.z becomes a metric, known as 

an integral probability metric. Let us provide some classical examples to illustrate: 


Example 12.34 (Kolmogorov metric) Suppose that P and Q are measures on the real line. 
For each ż € R, let 1(_..,, denote the {0, 1}-valued indicator function for the event {x < t}, and 
consider the function class F = {Ico | t € R}. We then have 


pa (P,Q) = sup IPX < )- QX < D| =IlFp — Follo, 


where Fp and Fg are the cumulative distribution functions of P and Q, respectively. Thus, 
this choice leads to the Kolmogorov distance between P and Q. & 


of real-valued functions bounded by one in the supremum norm. With this choice, we have 


Example 12.35 (Total variation distance) Consider the class F = {f : X > R | Ilfllo < 1} 


pE Osin] a fad? — dQ). 


Ilflloo<1 
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As we show in Exercise 12.17, this metric corresponds to (two times) the total variation 
distance 


IIP — Ql, = sup |P(A) - QA), 
ACX 


where the supremum ranges over all measurable subsets of X. + 


When we choose -¥ to be the unit ball of an RKHS, we obtain a mean discrepancy 
pseudometric that is easy to compute. In particular, given an RKHS with kernel function K, 
consider the associated pseudometric 


PnP, Q = sup, |EPLf(X)] - Eel f()]]. 


As verified in Exercise 12.18, the reproducing property allows us to obtain a simple closed- 
form expression for this pseudometric—namely, 


pr (P,Q) = E[K(X, X) + K(Z, Z’) — 2K(X, Z)], (12.32) 


where X,X’ ~ P and Z,Z’ ~ Q are all mutually independent random vectors. We refer to 
this pseudometric as a kernel means discrepancy, or KMD for short. 


Example 12.36 (KMD for linear and polynomial kernels) Let us compute the KMD for 
the linear kernel K(x, z) = (x, z} on R. Letting P and Q be two distributions on Rf with 
mean vectors u4, = Ep[X] and u4 = Eg[Z], respectively, we have 


PiP, Q = E| (X, X’) + Z, Z’) - UX, Z| 


= IWupll3 + Iual — 2 (up, Ha) 
= Ip — Hall3- 


Thus, we see that the KMD pseudometric for the linear kernel simply computes the Eu- 
clidean distance of the associated mean vectors. This fact demonstrates that KMD in this 
very special case is not actually a metric (but rather just a pseudometric), since pp(P, Q) = 0 
for any pair of distributions with the same means (i.e., 4p = Hq). 

Moving onto polynomial kernels, let us consider the homogeneous polynomial kernel of 
degree two, namely K(x, z) = (x, zy. For this choice of kernel, we have 


d 
(xx 
j=l 


where T, € R? is the second-order moment matrix with entries [T',];; = E[X;X;], and the 
squared Frobenius norm corresponds to the sum of the squared matrix entries. Similarly, 
we have E[K(Z, Z’)] = IIIC alles where T, is the second-order moment matrix for Q. Finally, 
similar calculations yield that 


d 
= $ ELX:XjJELX/X%] = IE pl, 


i,j=l 


E[K(X, XN] = E 


d 
EKA, Z)] = pli yl = Lp, Py), 


ij=l 
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where «-, -)) denotes the trace inner product between symmetric matrices. Putting together 
the pieces, we conclude that, for the homogeneous second-order polynomial kernel, we have 


py (P,Q) = IIT, - CA6- 4 


Example 12.37 (KMD for a first-order Sobolev kernel) Let us now consider the KMD in- 
duced by the kernel function K(x, z) = min{x,z}, defined on the Cartesian product 
[0, 1] x [0, 1]. As seen previously in Example 12.16, this kernel function generates the first- 
order Sobolev space 


1 
H'[0, 1] = {r: R[O, 1] > R | f(0) = 0 and il (f’(x))" dx < coh 
0 


with Hilbert norm ||f liro = f (f’(x))? dx. With this choice, we have 
p (P,Q) = E| min{X, X’} + min{Z, Z’} — 2 min{X, Z}]. 4 


12.7 Bibliographic details and background 


The notion of a reproducing kernel Hilbert space emerged from the study of positive semi- 
definite kernels and their links to Hilbert space structure. The seminal paper by Aron- 
szajn (1950) develops a number of the basic properties from first principles, including 
Propositions 12.27 and 12.31 as well as Theorem 12.11 from this chapter. The use of the 
kernel trick for computing inner products via kernel evaluations dates back to Aizerman et 
al. (1964), and underlies the success of the support vector machine developed by Boser et 
al. (1992), and discussed in Exercise 12.20. The book by Wahba (1990) contains a wealth 
of information on RKHSs, as well as the connections between splines and penalized meth- 
ods for regression. See also the books by Berlinet and Thomas-Agnan (2004) as well as 
Gu (2002). The book by Schélkopf and Smola (2002) provides a number of applications 
of kernels in the setting of machine learning, including the support vector machine (Ex- 
ercise 12.20) and related methods for classification, as well as kernel principal components 
analysis. The book by Steinwart and Christmann (2008) also contains a variety of theoretical 
results on kernels and reproducing kernel Hilbert spaces. 

The argument underlying the proofs of Propositions 12.32 and 12.33 is known as the 
representer theorem, and is due to Kimeldorf and Wahba (1971). From the computational 
point of view, it is extremely important, since it allows the infinite-dimensional problem of 
optimizing over an RKHS to be reduced to an n-dimensional convex program. Bochner’s 
theorem relates the positive semidefiniteness of kernel functions to the non-negativity of 
Fourier coefficients. In its classical formulation, it applies to the Fourier transform over 
R, but it can be generalized to all locally compact Abelian groups (Rudin, 1990). The 
results used to compute the asymptotic scaling of the eigenvalues of the Gaussian kernel in 
Example 12.25 are due to Widom (1963; 1964). 

There are a number of papers that study the approximation-theoretic properties of various 
types of reproducing kernel Hilbert spaces. For a given Hilbert space H and norm || - ||, such 
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results are often phrased in terms of the function 


A(f*; R) := aae (12.33) 
where |lg||, := ( f g? (x) dx)'/? is the usual L?-norm on a compact space X. This function 
measures how quickly the L?(X)-error in approximating some function f* decays as the 
Hilbert radius R is increased. See the papers (Smale and Zhou, 2003; Zhou, 2013) for results 
on this form of the approximation error. A reproducing kernel Hilbert space is said to be 
L?(X)-universal if limr_,.. A(f*; R) = 0 for any f* € L?(X). There are also various other 
forms of universality; see the book by Steinwart and Christmann (2008) for further details. 

Integral probability metrics of the form (12.31) have been studied extensively (Miiller, 
1997; Rachev et al., 2013). The particular case of RKHS-based distances are computation- 
ally convenient, and have been studied in the context of proper scoring rules (Dawid, 2007; 
Gneiting and Raftery, 2007) and two-sample testing (Borgwardt et al., 2006; Gretton et al., 
2012). 


12.8 Exercises 


Exercise 12.1 (Closedness of nullspace) Let L be a bounded linear functional on a Hilbert 
space. Show that the subspace null(L) = {f € H | L(f) = 0} is closed. 


Exercise 12.2 (Projections in a Hilbert space) Let G be a closed convex subset of a Hilbert 
space H. In this exercise, we show that for any f € H, there exists a unique g € G such that 


IE- fll = inf IE- fle. 
——_J>_———_— 
7 


This element g is known as the projection of f onto CG. 


(a) By the definition of infimum, there exists a sequence (g,)%; contained in G such that 
llen — fllu — p*. Show that this sequence is a Cauchy sequence. (Hint: First show that 
If- “lly converges to p*.) 

(b) Use this Cauchy sequence to establish the existence of g. 

(c) Show that the projection must be unique. 


(d) Does the same claim hold for an arbitrary convex set G? 


Exercise 12.3 (Direct sum decomposition in Hilbert space) Let H be a Hilbert space, and 
let G be a closed linear subspace of H. Show that any f € H can be decomposed uniquely as 
g +g", where g € Gand g+ € G+. In brief, we say that H has the direct sum decomposition 
Ge@C-. (Hint: The notion of a projection onto a closed convex set from Exercise 12.2 could 
be helpful to you.) 


Exercise 12.4 (Uniqueness of kernel) Show that the kernel function associated with any 
reproducing kernel Hilbert space must be unique. 


Exercise 12.5 (Kernels and Cauchy—Schwarz) 


12.8 Exercises 413 


(a) For any positive semidefinite kernel K: X x X — R, prove that 


K(x, z) < VK(x, x) K(z, z) for all x,z € X. 


(b) Show how the classical Cauchy—Schwarz inequality is a special case. 


Exercise 12.6 (Eigenfunctions for linear kernels) Consider the ordinary linear kernel 

K(x, z) = (x, z} on R? equipped with a probability measure P. Assuming that a random 

vector X ~ P has all its second moments finite, show how to compute the eigenfunctions of 

the associated kernel operator acting on L7(X; P) in terms of linear algebraic operations. 

Exercise 12.7 (Different kernels for polynomial functions) For an integer m > 1, consider 

the kernel functions K, (x, z) = (1 + xz)” and K(x, z) = Veto -a 

(a) Show that they are both PSD, and generate RKHSs of polynomial functions of degree 
at most m. 

(b) Why does this not contradict the result of Exercise 12.4? 


Exercise 12.8 True or false? If true, provide a short proof; if false, give an explicit counter- 
example. 


(a) Given two PSD kernels K, and K, the bivariate function K(x, z) = min j-12 K;(x, z) is 
also a PSD kernel. 

(b) Let f: X — H be a function from an arbitrary space X to a Hilbert space H. The 
bivariate function 


EO, fO 
ROD = TEO 


defines a PSD kernel on X x X. 


Exercise 12.9 (Left-right multiplication and kernels) Let K: X x X — R be a posi- 
tive semidefinite kernel, and let f: X — R be an arbitrary function. Show that K(x, z) = 
fODK(x, 2) f(Z is also a positive semidefinite kernel. 


Exercise 12.10 (Kernels and power sets) Given a finite set S , its power set P(S ) is the set of 
all the subsets of S. Show that the function K: P(S) x P(S) > R given by K(A, B) = 244 
is a positive semidefinite kernel function. 


Exercise 12.11 (Feature map for polynomial kernel) Recall from equation (12.14) the 
notion of a feature map. Show that the polynomial kernel K(x, z) = (1 + (x, z))” defined 
on the Cartesian product space R? x R? can be realized by a feature map x œ> ®(x) € RY, 
where D = 7), 
Exercise 12.12 (Probability spaces and kernels) Consider a probability space with events 
& and probability law P. Show that the real-valued function 


K(A, B) := P[AN B] — P[A]P[B] 


is a positive semidefinite kernel function on & x &. 
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Exercise 12.13 (From sets to power sets) Suppose that K: S xS — Ris asymmetric PSD 
kernel function on a finite set S. Show that 


K'(A, B) = ` K(x,z) 


xEA,zEB 


is a symmetric PSD kernel on the power set P(S). 


Exercise 12.14 (Kernel and function boundedness) Consider a PSD kernel K: XxX > R 
such that K(x, z) < b? for all x,z € X. Show that || ||. < b for any function f in the unit 
ball of the associated RKHS. 


Exercise 12.15 (Sobolev kernels and norms) Show that the Sobolev kernel defined in equa- 
tion (12.20) generates the norm given in equation (12.21). 


Exercise 12.16 (Hadamard products and kernel products) In this exercise, we explore prop- 
erties of product kernels and the Hadamard product of matrices. 


(a) Given two n x n matrices I and & that are symmetric and positive semidefinite, show 
that the Hadamard product matrix £ © F €e R” is also positive semidefinite. (The 
Hadamard product is simply the elementwise product—that is, (2 © D); = X,,I'j; for all 
iJ =E N2 gn) 

(b) Suppose that K, and K, are positive semidefinite kernel functions on X x X. Show that 
the function K(x, z) := Kı (x, z) K2(x, z) is a positive semidefinite kernel function. (Hint: 
The result of part (a) could be helpful.) 


Exercise 12.17 (Total variation norm) Given two probability measures P and Q on X, show 
that 


sup | | f(dP - dQ)| = 2 sup |P(A) - Q(A)I, 
If lle AcX 


where the left supremum ranges over all measurable functions f: X — R, and the right 
supremum ranges over all measurable subsets A of X. 


Exercise 12.18 (RKHS-induced semi-metrics) Let H be a reproducing kernel Hilbert space 
of functions with domain X, and let P and Q be two probability distributions on X. Show 
that 


ep, [Erf] - Eol? = EIKX, X’) + KZ, Z’) - 2K, ZI, 


where X, X’ ~ P and Z, Z’ ~ Q are jointly independent. 


Exercise 12.19 (Positive semidefiniteness of Gaussian kernel) Let X be a compact subset 
of R“. In this exercise, we work through a proof of the fact that the Gaussian kernel K(x, z) = 


I-25 


e 2% onX x X is positive semidefinite. 


(a) Let K be a PSD kernel, and let p be a polynomial with non-negative coefficients. Show 
that K(x, z) = p(K(x, z)) is a PSD kernel. 

(b) Show that the kernel K,(x,z) = e°®/© is positive semidefinite. (Hint: Part (a) and the 
fact that a pointwise limit of PSD kernels is also PSD could be useful.) 
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(c) Show that the Gaussian kernel is PSD. (Hint: The result of Exercise 12.9 could be use- 
ful.) 


Exercise 12.20 (Support vector machines and kernel methods) In the problem of binary 
classification, one observes a collection of pairs {(x;, y;)}‘_,, where each feature vector x; € 
R? is associated with a label y; € {—1,+1}, and the goal is derive a classification function 
that can be applied to unlabelled feature vectors. In the context of reproducing kernel Hilbert 
spaces, one way of doing so is by minimizing a criterion of the form 


a 1d 1 
f= emindi > max{0, 1 — y,f(xi)} + sais (12.34) 
i=1 


where H is a reproducing kernel Hilbert space, and 4, > 0 is a user-defined regularization 
parameter. The classification rule is then given by x + sign(f(x)). 


(a) Prove that fcan be written in the form re = Ti di @KC, x), for some vector w € R”. 
(b) Use part (a) and duality theory to show that an optimal coefficient vector @ can be ob- 


tained by solving the problem 


= 1< lore 
we vema Ja — joke s.t. a; € [0, Tr] for alli = 1,...,n, 
i=l 


acR’ | n & 


and where K € R”*” has entries Ki = yy; K(x; x;)/n. 


13 


Nonparametric least squares 


In this chapter, we consider the problem of nonparametric regression, in which the goal is 
to estimate a (possibly nonlinear) function on the basis of noisy observations. Using results 
developed in previous chapters, we analyze the convergence rates of procedures based on 
solving nonparametric versions of least-squares problems. 


13.1 Problem set-up 


A regression problem is defined by a set of predictors or covariates x € X, along with 
a response variable y € Y. Throughout this chapter, we focus on the case of real-valued 
response variables, in which the space Y is the real line or some subset thereof. Our goal is 
to estimate a function f: X — Y such that the error y — f(x) is as small as possible over 
some range of pairs (x,y). In the random design version of regression, we model both the 
response and covariate as random quantities, in which case it is reasonable to measure the 
quality of f in terms of its mean-squared error (MSE) 


Ly = Exy[(¥ - f(X)’]. (13.1) 


The function f* minimizing this criterion is known as the Bayes’ least-squares estimate or 
the regression function, and it is given by the conditional expectation 


f(x) = EY |X =x], (13.2) 


assuming that all relevant expectations exist. See Exercise 13.1 for further details. 

In practice, the expectation defining the MSE (13.1) cannot be computed, since the joint 
distribution over (X, Y) is not known. Instead, we are given a collection of samples {(x;, yi) }/_,, 
which can be used to compute an empirical analog of the mean-squared error, namely 


a pZ 
Lr := =D 0i- fOD. (13.3) 
i=] 


The method of nonparametric least squares, to be discussed in detail in this chapter, is based 
on minimizing this least-squares criterion over some suitably controlled function class. 


13.1.1 Different measures of quality 


Given an estimate f of the regression function, it is natural to measure its quality in terms 
of the excess risk—namely, the difference between the optimal MSE £p. achieved by the 
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regression function f*, and that achieved by the estimate f. In the special case of the least- 
squares cost function, it can be shown (see Exercise 13.1) that this excess risk takes the 
form 


Ls- Lp = Exlf(X) - f(y], (13.4) 
-_ OS - Or 
Ift— FIP 


L2(P) 


where P denotes the distribution over the covariates. When this underlying distribution is 
clear from the context, we frequently adopt the shorthand notation ||f — f*||, for the L?(P)- 
norm. 

In this chapter, we measure the error using a closely related but slightly different measure, 
one that is defined by the samples {x;}'_, of the covariates. In particular, they define the 
empirical distribution P„, := 1 dL) ôx that places a weight 1/n on each sample, and the 
associated L?(P„)-norm is given by 


1/2 
: (13.5) 


‘ lx RR 

If- flee» = |> 2 Fa- fad) | 

In order to lighten notation, we frequently use F= f*lla as a shorthand for the more cum- 

bersome IF- f'\lz@,). Throughout the remainder of this chapter, we will view the samples 

{x;}_, as being fixed, a set-up known as regression with a fixed design. The theory in this 

chapter focuses on error bounds in terms of the empirical L?(P,,)-norm. Results from Chap- 

ter 14 to follow can be used to translate these bounds into equivalent results in the population 
L?(P)-norm. 


13.1.2 Estimation via constrained least squares 


Given a fixed collection {x;}'_, of fixed design points, the associated response variables {y;}7_, 
can always be written in the generative form 


y= f (x) + vi, fori =1,2,...,n, (13.6) 


where v; is a random variable representing the “noise” in the ith response variable. Note 
that these noise variables must have zero mean, given the form (13.2) of the regression 
function f*. Apart from this zero-mean property, their structure in general depends on the 
distribution of the conditioned random variable (Y | X = x). In the standard nonparametric 
regression model, we assume the noise variables are drawn in an i.i.d. manner from the 
N(0, 0”) distribution, where o > 0 is a standard deviation parameter. In this case, we can 
write v; = ow;, where w; ~ N(0, 1) is a Gaussian random variable. 

Given this set-up, one way in which to estimate the regression function f* is by con- 
strained least squares—that is, by solving the problem! 


& . en , 
fe emid: 20 — f(x) l, (13.7) 


' Although the renormalization by n~! in the definition (13.7) has no consequence on ie we do so in order to 
emphasize the connection between this method and the L?(P,)-norm. 
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where .¥ is a suitably chosen subset of functions. When v; ~ N(0, o°), note that the estimate 
defined by the criterion (13.7) is equivalent to the constrained maximum likelihood estimate. 
However, as with least-squares regression in the parametric setting, the estimator is far more 
generally applicable. 

Typically, we restrict the optimization problem (13.7) to some appropriately chosen sub- 
set of F —for instance, a ball of radius R in an underlying norm ||- ||. Choosing F to be 
a reproducing kernel Hilbert space, as discussed in Chapter 12, can be useful for computa- 
tional reasons. It can also be convenient to use regularized estimators of the form 


= {id 
fe wenn 2 0 =J tansi}, (13.8) 


where 4, > Ois a suitably chosen regularization weight. We return to analyze such estimators 
in Section 13.4. 


13.1.3 Some examples 


Let us illustrate the estimators (13.7) and (13.8) with some examples. 


Example 13.1 (Linear regression) For a given vector @ € Rf, define the linear function 
fo(x) = (0, x). Given a compact subset C C R“, consider the function class 


Fo := (fo: Rt > R | €C}. 


With this choice, the estimator (13.7) reduces to a constrained form of least-squares estima- 
tion, more specifically 


PR l 1 5 
Oe aremin {Ip = xag} > 


where X € R”*“ is the design matrix with the vector x; € Rf in its ith row. Particular instances 
of this estimator include ridge regression, obtained by setting 


C = f6 € Rf | AIÈ < Ra) 
for some (squared) radius R, > 0. More generally, this class of estimators contains all the 
constrained €,-ball estimators, obtained by setting 


d 
C={9ER*| ) I" < R} 
j=l 
for some q € [0,2] and radius R, > 0. See Figure 7.1 for an illustration of these sets for 
q € (0, 1]. The constrained form of the Lasso (7.19), as analyzed in depth in Chapter 7, is a 
special but important case, obtained by setting g = 1. 


Whereas the previous example was a parametric problem, we now turn to some nonpara- 
metric examples: 
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Example 13.2 (Cubic smoothing spline) Consider the class of twice continuously differ- 
entiable functions f: [0,1] — R, and for a given squared radius R > 0, define the function 
class 


1 
F(R) := fz: [0,1] > R| T F'O dx <r}, (13.9) 
0 


where f” denotes the second derivative of f. The integral constraint on f” can be under- 
stood as a Hilbert norm bound in the second-order Sobolev space H°[0, 1] introduced in 
Example 12.17. In this case, the penalized form of the nonparametric least-squares estimate 
is given by 


n 1 
f € arg min |: Soi — f(x) + an i F'O ax] (13.10) 
f ("aH 0 


where 4A, > 0 is a user-defined regularization parameter. It can be shown that any minimizer 
fis a cubic spline, meaning that it is a piecewise cubic function, with the third derivative 
changing at each of the distinct design points x;. In the limit as R —> 0 (or equivalently, as 
An — +00), the cubic spline fit f becomes a linear function, since we have f” = 0 only for a 
linear function. 4 


The spline estimator in the previous example turns out to be a special case of a more gen- 
eral class of estimators, based on regularization in a reproducing kernel Hilbert space (see 
Chapter 12 for background). Let us consider this family more generally: 


Example 13.3 (Kernel ridge regression) Let H be a reproducing kernel Hilbert space, 
equipped with the norm || - ||}. Given some regularization parameter 4, > 0, consider the 
estimator 


os : 1 n 
f € arg min |; 20 -fa + sani 


As discussed in Chapter 12, the computation of this estimate can be reduced to solving a 
quadratic program involving the empirical kernel matrix defined by the design points {x;}""_,. 
In particular, if we define the kernel matrix with entries K;; = K(x;, x;)/n, then the solution 
takes the form ff) = oR wi aK, x), where @ := (K + AL) T In Exercise 13.3, we 
show how the spline estimator from Example 13.2 can be understood in the context of kernel 
ridge regression. & 


Let us now consider an example of what is known as shape-constrained regression. 


Example 13.4 (Convex regression) Suppose that f*: C — R is known to be a convex 
function over its domain C, some convex and open subset of R“. In this case, it is natural to 
consider the least-squares estimator with a convexity constraint—namely 


a lx 
f € arg a, (i 2 Oi - ress? : 


f is convex 
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As stated, this optimization problem is infinite-dimensional in nature. Fortunately, by 
exploiting the structure of convex functions, it can be converted to an equivalent finite- 
dimensional problem. In particular, any convex function f is subdifferentiable at each point 
in the (relative) interior of its domain C. More precisely, at any interior point x € C, there 
exists at least one vector z € R? such that 


fO = f(x) + (z, y- x) for ally €C. (13.11) 


Any such vector is known as a subgradient, and each point x € C can be associated with 
the set f(x) of its subgradients, which is known as the subdifferential of f at x. When f is 
actually differentiable at x, then the lower bound (13.11) holds if and only if z = V f(x), so 
that we have 0f(x) = {V f(x)}. See the bibliographic section for some standard references in 
convex analysis. 

Applying this fact to each of the sampled points {x;} 
subgradient vectors Z; € R? such that 


n 


7, We find that there must exist 


f(x) = f(x) + Zi x- Xj) for all x € C. (13.12) 


Since the cost function depends only on the values y; := f(x;), the optimum does not de- 
pend on the function behavior elsewhere. Consequently, it suffices to consider the collection 
{Qi zi) }/_, of function value and subgradient pairs, and solve the optimization problem 


ye tee a 
min — ia on 
(Giz, N 2, (i - Yi) 


such that y; > y; + (Zi, xj — Xi) for alli, j = 1,2,...,n. 


Note that this is a convex program in N = n(d + 1) variables, with a quadratic cost function 
and a total of 2(3) linear constraints. 


An optimal solution {(j;,z;)}_, can be used to define the estimate f: C > R via 


TAIS max {yi + Gi x — x;)}. (13.14) 
As the maximum of a collection of linear functions, the function fis convex. Moreover, a 
short calculation—using the fact that {(;, 2;)}/_, are feasible for the program (13.13)—shows 
that F(x) = J; for alli = 1,2,...,. Figure 13.1(a) provides an illustration of the convex 
regression estimate (13.14), showing its piecewise linear nature. 

There are various extensions to the basic convex regression estimate. For instance, in the 
one-dimensional setting (d = 1), it might be known a priori that f is a non-decreasing 
function, so that its derivative (or, more generally, subgradients) are non-negative. In this 
case, it is natural to impose additional non-negativity constraints (Z; > 0) on the subgradients 
in the estimator (13.13). Figure 13.1(b) compares the standard convex regression estimate 
with the estimator that imposes these additional monotonicity constraints. + 


13.2 Bounding the prediction error 


From a statistical perspective, an essential question associated with the nonparametric least- 
squares estimate (13.7) is how well it approximates the true regression function f*. In this 
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Figure 13.1 (a) Illustration of the convex regression estimate (13.14) based on a 
fixed design with n = 11 equidistant samples over the interval C = [-1, 1]. (b) Ordi- 
nary convex regression compared with convex and monotonic regression estimate. 


section, we develop some techniques to bound the error IF- f'lln, as measured in the Z? (P„)- 
norm. In Chapter 14, we develop results that allow such bounds to be translated into bounds 
in the L?(P)-norm. 

Intuitively, the difficulty of estimating the function f* should depend on the complexity 
of the function class F in which it lies. As discussed in Chapter 5, there are a variety of 
ways of measuring the complexity of a function class, notably by its metric entropy or its 
Gaussian complexity. We make use of both of these complexity measures in the results to 
follow. 

Our first main result is defined in terms of a localized form of Gaussian complexity: it 
measures the complexity of the function class F, locally in a neighborhood around the true 
regression function f*. More precisely, we define the set 


F = AP Jans |S eA hy (13.15) 


corresponding to an f*-shifted version of the original function class 2. For a given radius 
ô > 0, the local Gaussian complexity around f* at scale 6 is given by 


Gal; F*) = Ey sup - y wig], (13.16) 
se i=1 


llgllnso 


where the variables {w;}?_, are i.i.d. N(0, 1) variates. Throughout this chapter, this complex- 
ity measure should be understood as a deterministic quantity, since we are considering the 
case of fixed covariates {x;}"_,. 

A central object in our analysis is the set of positive scalars 6 that satisfy the critical 
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inequality 
e G* 
GF") _ 6 


ar ae (13.17) 
As we verify in Lemma 13.6, whenever the shifted function class F* is star-shaped,” the 
left-hand side is a non-increasing function of ô, which ensures that the inequality can be 
satisfied. We refer to any 6, > O satisfying inequality (13.17) as being valid, and we use 
ô, > 0 to denote the smallest positive radius for which inequality (13.17) holds. See the 
discussion following Theorem 13.5 for more details on the star-shaped property and the 
existence of valid radii 6,. 

Figure 13.2 illustrates the non-increasing property of the function 6 =œ G,(6)/6 for two 
different function classes: a first-order Sobolev space in Figure 13.2(a), and a Gaussian ker- 
nel space in Figure 13.2(b). Both of these function classes are convex, so that the star-shaped 
property holds for any f*. Setting o = 1/2 for concreteness, the critical radius ô% can be de- 
termined by finding where this non-increasing function crosses the line with slope one, as 
illustrated. As will be clarified later, the Gaussian kernel class is much smaller than the 
first-order Sobolev space, so that its critical radius is correspondingly smaller. This ordering 
reflects the natural intuition that it should be easier to perform regression over a smaller 
function class. 


Critical 6 for Sobolev kernel 


Critical 6 for Gaussian kernel 


Q 
© 
N 


Function value 
Function value 


04 1 1 1 1 1 04 1 ; 1 fi 1 
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 
Radius 6 Radius 6 
(a) (b) 


Figure 13.2 Illustration of the critical radius for sample size n = 100 and two dif- 
ferent function classes. (a) A first-order Sobolev space. (b) A Gaussian kernel class. 
In both cases, the function 6 > GEF) plotted as a solid line, is non-increasing, as 
guaranteed by Lemma 13.6. The critical radius 6, marked by a gray dot, is deter- 
mined by finding its intersection with the line of slope 1/(20) with o = 1, plotted as 
the dashed line. The set of all valid ô, consists of the interval [67, 00). 


Some intuition: | Why should the inequality (13.17) be relevant to the analysis of the 
nonparametric least-squares estimator? A little calculation is helpful in gaining intuition. 
Since f and f* are optimal and feasible, respectively, for the constrained least-squares prob- 


2 A function class #7 is star-shaped if for any h € # and « € [0, 1], the rescaled function œh also belongs 
to #. 
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lem (13.7), we are guaranteed that 
= Yor - fa? < = er FA 
an L L l — an mn l l . 
Recalling that y; = f*(x;) + ow;, some simple algebra leads to the equivalent expression 


SF FIR <<) wa- POD), (13.18) 
ary 
which we call the basic inequality for nonparametric least squares. 

Now, by definition, the difference function f- f* belongs to #*, so that we can bound 
the right-hand side by taking the supremum over all functions g € ¥* with ||gll, < If- filh- 
Reasoning heuristically, this observation suggests that the squared error 67 := E (lf - fl] 
should satisfy a bound of the form 


Guild; F*) 
F 


By definition (13.17) of the critical radius 67, this inequality can only hold for values of ô < 
6*. In summary, this heuristic argument suggests a bound of the form E[||f — f*|7] < (6%). 


2 
A < o G,(ô; F”) or equivalently X. < (13.19) 
2 20 


To be clear, the step from the basic inequality (13.18) to the bound (13.19) is not rigor- 
ously justified for various reasons, but the underlying intuition is correct. Let us now state a 
rigorous result, one that applies to the least-squares estimator (13.7) based on observations 
from the standard Gaussian noise model y; = f*(x;) + ow}. 


Theorem 13.5 Suppose that the shifted function class F* is star-shaped, and let 6, 
be any positive solution to the critical inequality (13.17). Then for any t = ôn, the 
nonparametric least-squares estimate f, satisfies the bound 


PIIR - fk = 1616,| < eB? (13.20) 


Remarks: The bound (13.20) provides non-asymptotic control on the regression error 
If - Fe: By integrating this tail bound, it follows that the mean-squared error in the 
L’(P,,)-semi-norm is upper bounded as 


2 
gs oO R 
Elli -— FIÈ] < c fa + = for some universal constant c. 


As shown in Exercise 13.5, for any function class ¥ that contains the constant function 
f = 1, we necessarily have 52 > 2c so that (disregarding constants) the 5? term is always 
the dominant one. 

For concreteness, we have stated the result for the case of additive Gaussian noise (v; = 
ow;). However, as the proof will clarify, all that is required is an upper tail bound on the 
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random variable 
1 . Vi 
Zy(6) := — ) —2(Xx; 
©) oe F 5 829) 


8 i=1 
llgllns0 


in terms of its expectation. The expectation E[Z,,(6)] defines a more general form of (poten- 
tially non-Gaussian) noise complexity that then determines the critical radius. 

The star-shaped condition on the shifted function class F* = F — f* is needed at various 
parts of the proof, including in ensuring the existence of valid radii 6, (see Lemma 13.6 
to follow). In explicit terms, the function class .F* is star-shaped if for any g € F and 
a € [0,1], the function ag also belongs to ¥*. Equivalently, we say that F is star-shaped 
around f*. For instance, if F is convex, then as illustrated in Figure 13.3 it is necessarily 
star-shaped around any f* € F. Conversely, if F is not convex, then there must exist 
choices f* € F such that F* is not star-shaped. However, for a general non-convex set 
F, it is still possible that .¥* is star-shaped for some choices of f*. See Figure 13.3 for an 
illustration of these possibilities, and Exercise 13.4 for further details. 


(a) (b) 


Figure 13.3 Illustration of star-shaped properties of sets. (a) The set ¥ is convex, 
and hence is star-shaped around any of its points. The line between f* and f is 
contained within .¥, and the same is true for any line joining any pair of points in 
F. (b) A set F that is not star-shaped around all its points. It fails to be star-shaped 
around the point f*, since the line drawn to f € F does not lie within the set. 
However, this set is star-shaped around the point ft. 


If the star-shaped condition fails to hold, then Theorem 13.5 can instead by applied with 
On defined in terms of the star hull 
star(.F*; 0) := {ag |g € F*,æ € [0, 1]} = {af - f) | f € F,a€ [0, 1]}. (13.21) 


Moreover, since the function f* is not known to us, we often replace Y* with the larger 
class 


OF = F -F ={fi -fh | foh Ee Fh (13.22) 


or its star hull when necessary. We illustrate these considerations in the concrete examples 


13.2 Bounding the prediction error 425 


to follow. 


Let us now verify that the star-shaped condition ensures existence of the critical radius: 


Lemma 13.6 For any star-shaped function class #@, the function 6 œ> Cine is non- 


increasing on the interval (0, œ). Consequently, for any constant c > 0, the inequality 
Gr; H) 


5 <co (13.23) 


has a smallest positive solution. 


Proof So as to ease notation, we drop the dependence of G, on the function class # 
throughout this proof. Given a pair 0 < 6 < t, it suffices to show that EGA) < G,,(6). Given 


any function h € Z with ||All, < t, we may define the rescaled function h = êh, and write 


2 HE Dasa} = = HX vT) 


By construction, we have Wlln < 6; moreover, since 6 < t, the star-shaped assumption 
guarantees that he Z. Consequently, for any h formed in this way, the right-hand side is 
at most G,,(6) in expectation. Taking the supremum over the set # N {||All, < t} followed by 
expectations yields G,,(t) on the left-hand side. Combining the pieces yields the claim. 


In practice, determining the exact value of the critical radius ô, may be difficult, so that 
we seek reasonable upper bounds on it. As shown in Exercise 13.5, we always have 67 < o 
but this is a very crude result. By bounding the local Gaussian complexity, we will obtain 
much finer results, as illustrated in the examples to follow. 


13.2.1 Bounds via metric entropy 


Note that the localized Gaussian complexity corresponds to the expected absolute maximum 
of a Gaussian process. As discussed in Chapter 5, Dudley’s entropy integral can be used to 
upper bound such quantities. 

In order to do so, let us begin by introducing some convenient notation. For any function 
class #, we define B,(6;#%) := {h € star(H) | |All, < 6}, and we let N,(t;B,(6; 2) 
denote the t-covering number of B,,(6; #) in the norm ||- ||,. With this notation, we have the 
following corollary: 


426 Nonparametric least squares 


Corollary 13.7 Under the conditions of Theorem 13.5, any 6 € (O, o] such that 


af: i vlog N,(t; B,(6; F*)) dt < 2 (13.24) 
O8 Nal, Dni; S . 
va Je `” 4o 


satisfies the critical inequality (13.17), and hence can be used in the conclusion of 
L Theorem 13.5. 


Proof For any 6 € (0, 0], we have 2 = < 6, so that we can construct a minimal Ž -covering 


of the set B,(6; F*) in the L?(P,)-norm, say {g!,..., g}. For any function g € B„,(8; F*), 
there is an index j € [M] such that ||g} — gll, < & Consequently, we have 


POHE dma" p2 X meo- gia) 
i=l 


g TaN. 
2 max, | Sn = L aie ey soe 


(ii) 1< niw & 
< max |- di wga + —— —., 
Jal.M In & i8 OD) n 4o 


where step (i) follows from the triangle inequality, step (ii) follows from the Cauchy- 
Schwarz inequality and step (iii) uses the covering property. Taking the supremum over 
g € B,(6; ¥*) on the left-hand side and then expectation over the noise, we obtain 


n 2 
G,(6) < E, Ln, k B val + A (13.25) 


where we have used the fact that E, + =e <1. 

It remains to upper bound the expected maximum over the M functions in the cover, and 
we do this by using the chaining method from Chapter 5. Define the family of Gaussian 
random variables Z(g/) := vi XL, wigi (xi) for j = 1,..., M. Some calculation shows that 
they are zero-mean, and their associated semi-metric is given by 


p2(g/, 8^) := var(Z(g/) — Z(g")) = Ile’ — g*II?. 


Since ||g||, < 6 for all g € B,(6; F*), the coarsest resolution of the chaining can be set to 
ô, and we can terminate it at o since any member of our finite set can be reconstructed 
exactly at this resolution. Working through the chaining argument, we find that 


IS Z ă IZ(8’)| 
Ep resli Èvel- E| max, 5 | 


s MIN Sm I Ebeg 


6 
= log N,(t; B,(6; F*)) dt. 
va Je g N,( ( )) 


Combined with our earlier bound (13.25), this establishes the claim. 
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Some examples are helpful in understanding the uses of Theorem 13.5 and Corollary 13.7, 
and we devote the following subsections to such illustrations. 


13.2.2 Bounds for high-dimensional parametric problems 
We begin with some bounds for parametric problems, allowing for a general dimension. 


Example 13.8 (Bound for linear regression) As a warm-up, consider the standard linear 
regression model y; = (6°, x;) + w; where 6* € R¢. Although it is a parametric model, 
some insight can be gained by analyzing it using our general theory. The usual least-squares 
estimate corresponds to optimizing over the function class 


Fin = (fol) = (0, -> | 0 € RY}. 


Let X € R’@ denote the design matrix, with x; € R as its ith row. In this example, we use 
our general theory to show that the least-squares estimate satisfies a bound of the form 


T_ gyi (2 

ieg e E ee 
n n 

with high probability. To be clear, in this special case, this bound (13.26) can be obtained by 
a direct linear algebraic argument, as we explore in Exercise 13.2. However, it is instructive 

to see how our general theory leads to concrete predictions in a special case. 
We begin by observing that the shifted function class F ž* is equal to Aji, for any choice 
of f*. Moreover, the set Fin is convex and hence star-shaped around any point (see Exer- 
cise 13.4), so that Corollary 13.7 can be applied. The mapping 6 + ||foll, = a defines a 


(13.26) 


norm on the subspace range(X), and the set B,,(6; Fin) is isomorphic to a 6-ball within the 
space range(X). Since this range space has dimension given by rank(X), by a volume ratio 
argument (see Example 5.8), we have 


2 
log N,,(t;B,(6; Fin) < r log (1 + *), where r := rank(X). 


Using this upper bound in Corollary 13.7, we find that 


l [vi WEEE Fiat fi f yi G 
ee O nF; Dn(05; in > we O rat 
yn 0 g : n Jo 8 t 
Y 1 
oF et douis a 
n Jo u 
CPF) Ae 
n 


where we have made the change of variables u = t/ô in step (i), and the final step (ii) follows 
since the integral is a constant. Putting together the pieces, an application of Corollary 13.7 
yields the claim (13.26). In fact, the bound (13.26) is minimax-optimal up to constant factors, 
as we will show in Chapter 15. & 


Let us now consider another high-dimensional parametric problem, namely that of sparse 
linear regression. 
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Example 13.9 (Bounds for linear regression over ¢,-“balls”) Consider the case of sparse 
linear regression, where the d-variate regression vector 0 is assumed to lie within the £,-ball 
of radius R,—namely, the set 


d 
By(R,) := {0 € R° 1 $ 10; < Rul. (13.27) 


j=l 


See Figure 7.1 for an illustration of these sets for different choices of q € (0, 1]. Consider 
class of linear functions fo(x) = (0, x) given by 


F Ra) = {fo 0 € By(R,)}. (13.28) 


We adopt the shorthand .¥, when the radius R, is clear from context. 
In this example, we focus on the range q € (0, 1). Suppose that we solve the least-squares 
problem with £; regularization—that is, we compute the estimate 


= Ji% 2 
0 € arg min t 2 Oi- (x 0) | . (13.29) 
Unlike the £1-constrained Lasso analyzed in Chapter 7, note that this is not a convex pro- 
gram. Indeed, for q € (0, 1), the function class -¥,(R,) is not convex, so that there exists 6* € 
B,(R,) such that the shifted class Fy = F4- fo: is not star-shaped. Accordingly, we instead 
focus on bounding the metric entropy of the function class ¥,(R,) — F (R) = 2-F4(Ry)- 
Note that for all q € (0, 1) and numbers a, b € R, we have |a + bl? < |a|? + |b|?, which implies 
that 2.7,(R,) is contained with .¥,(2R,). 

It is known that for q € (0, 1), and under mild conditions on the choice of t relative to the 
triple (n, d, R,), the metric entropy of the ¢,-ball with respect to £;-norm is upper bounded 
by 


TNE 
log N24(® < Cy [Ri T log d], (13.30) 


where C, is a constant depending only on q. 

Given our design vectors {x;}?_,, consider the n x d design matrix X with ca as its ith row, 
and let X; € R” denote its jth column. Our objective is to bound the metric entropy of the 
set of all vectors of the form 


xo 1 £ 
© = — Y Xo; (13.31) 
TED 


as 0 ranges over B,(R,), an object known as the q-convex hull of the renormalized 
column vectors {X1,..., Xa}/ Vn. Letting C denote a numerical constant such that 


viij 


same scaling as the original ¢,-ball. See the bibliographic section for further discussion of 
these facts about metric entropy. 
Exploiting this fact and our earlier bound (13.30) on the metric entropy of the £,-ball, we 
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1 j x, logd it 
Ti Me log N,(t, B,(5;-F,(2R,))) dt $ Ry y= i (3 dt 


< R? [logd 5, 
n 


a calculation valid for all q € (0,1). Corollary 13.7 now implies that the critical condi- 
tion (13.17) is satisfied as long as 


L 2] 4 2] _4 
RZ” je" < tza or equivalently Ry(— ES)! oS 8. 


Theorem 13.5 then implies that 


find that 


X- 6")||2 2] ? 
iape S a a 


with high probability. Although this result is a corollary of our general theorem, this rate is 
minimax-optimal up to constant factors, meaning that no estimator can achieve a faster rate. 
See the bibliographic section for further discussion and references of these connections. æ 


13.2.3 Bounds for nonparametric problems 


Let us now illustrate the use of our techniques for some nonparametric problems. 
Example 13.10 (Bounds for Lipschitz functions) Consider the class of functions 
F,;(L) := {f: [0,1] > R | f(O) = 0, fis L-Lipschitz}. (13.32) 


Recall that f is L-Lipschitz means that | f(x) — f(x’)| < Lx — x’| for all x, x’ € [0, 1]. Let us 
analyze the prediction error associated with nonparametric least squares over this function 
class. 

Noting the inclusion 


Fy ig(L) — Frip(L) = 2-Fip(L) S Arip(2L), 


it suffices to upper bound the metric entropy of -F,;)(2L). Based on our discussion 
from Example 5.10, the metric entropy of this class in the supremum norm scales as 
log Noo(€; FLip(2L)) =~ (L/e). Consequently, we have 


1 ô ô 
a5 T log Nutt B (5; Fip(2L))) dt Z f [og Nott: Fp QL) dt 


0 


where X< denotes an inequality holding apart from constants not dependent on the triplet 
TS 2 2 

(6, L, n). Thus, it suffices to choose 6, > 0 such that “2 < on or equivalently 52 ~ CZP. 
yn N o n n 
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Putting together the pieces, Corollary 13.7 implies that the error in the nonparametric least- 
squares estimate satisfies the bound 


(“ey 


IF- FIÈ x (13.33) 


1/3 
with probability at least 1 — ce Aa) : & 


Example 13.11 (Bounds for convex regression) As a continuation of the previous example, 
let us consider the class of convex 1-Lipschitz functions, namely 


F cony([0, 1]; 1) := {f: [0,1] > R | f(0) = 0 and f is convex and 1-Lipschitz}. 


As discussed in Example 13.4, computation of the nonparametric least-squares estimate over 
such convex classes can be reduced to a type of quadratic program. Here we consider the 
Statistical rates that are achievable by such an estimator. 

It is known that the metric entropy of Feony, when measured in the infinity norm, satisfies 
the upper bound 


1\1/2 
log NCE; Foon Il lhe) Z (=) (13.34) 


for all €e > O sufficiently small. (See the bibliographic section for details.) Thus, we can 
again use an entropy integral approach to derive upper bounds on the prediction error. In 
particular, calculations similar to those in the previous example show that the conditions of 


Corollary 13.7 hold for 6% ~ (ys, and so we are guaranteed that 


2 


IF- FR (S (13.35) 


1/5 
with probability at least 1 — ce 2) : 

Note that our error bound (13.35) for convex Lipschitz functions is substantially faster 
than our earlier bound (13.33) for Lipschitz functions without a convexity constraint—in 
particular, the respective rates are n~*/> versus n~7/?. In Chapter 15, we show that both of 
these rates are minimax-optimal, meaning that, apart from constant factors, they cannot be 
improved substantially. Thus, we see that the additional constraint of convexity is signif- 
icant from a statistical point of view. In fact, as we explore in Exercise 13.8, in terms of 
their estimation error, convex Lipschitz functions behave exactly like the class of all twice- 
differentiable functions with bounded second derivative, so that the convexity constraint 
amounts to imposing an extra degree of smoothness. 4 


13.2.4 Proof of Theorem 13.5 


We now turn to the proof of our previously stated theorem. 
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Establishing a basic inequality 


Recall the basic inequality (13.18) established in our earlier discussion. In terms of the 
shorthand notation A = f — f*, it can be written as 


l a oč a 
sla, < A > w;A(x;). (13.36) 
i=l 


By definition, the error function A= f= f* belongs to the shifted function class F*. 


Controlling the right-hand side 


In order to control the stochastic component on the right-hand side, we begin by stating 
an auxiliary lemma in a somewhat more general form, since it is useful for subsequent 
arguments. Let # be an arbitrary star-shaped function class, and let 6, > 0 satisfy the 
inequality GA < x. For a given scalar u > 6,, define the event 


Alu) = fag e 20 llel > u} | I= >) wig) > 2 (13.37) 
i=1 


The following lemma provides control on the probability of this event: 


Lemma 13.12 For allu > 6,, we have 


PIAWON < A, (13.38) 


Let us prove the main result by exploiting this lemma, in particular with the settings 
H = F* andu = vt6, for some t > 6,, so that we have 


PLA V16,)] > 1- eo, 


If ||All, < V6, then the claim is immediate. Otherwise, we have A € ¥* and |All, > Vtôn, 
so that we may condition on A‘( ytô„) so as to obtain the bound 


|Z Yi wÂ] < 211A V18, 
i=1 


Consequently, the basic inequality (13.36) implies that \|Al|2 < A\IAlL, vton, or equivalently 
that IÂ]? < 16t6,, a bound that holds with probability at least 1 — enact, 


In order to complete the proof of Theorem 13.5, it remains to prove Lemma 13.12. 


Proof of Lemma 13.12 


Our first step is to reduce the problem to controlling a supremum over a subset of functions 
satisfying the upper bound |[g]|,, < u. Suppose that there exists some g € # with |lg||, 2 u 
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such that 
T n 
[r Damian (13.39) 


Defining the function g := gar 8» We observe that gll, = u. Since g € # and a € (© 11, 
the star-shaped assumption implies that g € .#. Consequently, we have shown that if there 
exists a function g satisfying the inequality (13.39), which occurs whenever the event A(u) 


is true, then there exists a function g € .# with |[g]|,, = u such that 


5 `, wz) = a |Z 2 wig} > 2u’. 


i=1 


We thus conclude that 


P[A(u)] < PIZ, (u) > 24°], where Z,(u) := sup j£ D wig(x)l. (13.40) 

gw N 

Bhs 5 
Since the noise variables w; ~ N(0, 1) are i.i.d., the variable = Xj; w)2(x;) is zero-mean 
and Gaussian for each fixed g. Therefore, the variable Z„(u) corresponds to the supremum of 
a Gaussian process. If we view this supremum as a function of the standard Gaussian vector 
(W1,.--,;Wn), then it can be verified that the associated Lipschitz constant is at most =. 


Consequently, Theorem 2.26 guarantees the tail bound P[Z,(u) > E[Z,(w)] + s] < em, 
valid for any s > 0. Setting s = u? yields 


P[Z,(u) > E[Z,(u)] +2] < e2. (13.41) 


Finally, by definition of Z,(u) and G,(u), we have E[Z,(u)] = oG,(u). By Lemma 13.6, the 
function v > &® is non-decreasing, and since u > 6, by assumption, we have 


(i 
D0 <2 È 5 2 < ôn, 
u n 


where step (i) uses the critical condition (13.17). Putting together the pieces, we have shown 
that E[Z,(u)] < uô„. Combined with the tail bound (13.41), we obtain 


Gi nu2 
P{Z,(u) > 2u] < P[Z,(u) > ud, +U] < e2, 


where step (ii) uses the inequality u? > uôn. 


13.3 Oracle inequalities 


In our analysis thus far, we have assumed that the regression function f* belongs to the 
function class ¥ over which the constrained least-squares estimator (13.7) is defined. In 
practice, this assumption might be violated, but it is nonetheless of interest to obtain bounds 
on the performance of the nonparametric least-squares estimator. In such settings, we expect 
its performance to involve both the estimation error that arises in Theorem 13.5, and some 
additional form of approximation error, arising from the fact that f* ¢ F. 
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A natural way in which to measure approximation error is in terms of the best approxima- 
tion to f* using functions from F. In the setting of interest in this chapter, the error in this 
best approximation is given by inf eg ||f — f*\. Note that this error can only be achieved 
by an “oracle” that has direct access to the samples {f*(x;)}'_,. For this reason, results that 
involve this form of approximation error are referred to as oracle inequalities. With this set- 
up, we have the following generalization of Theorem 13.5. As before, we assume that we 
observe samples {(y;, x;)}_, from the model y; = f*(x;) + ow;, where w; ~ N(0, 1). The 
reader should also recall the shorthand notation 0F = {fi — fo | fi, fo € F}. We assume 
that this set is star-shaped; if not, it should be replaced by its star hull in the results to follow. 


Theorem 13.13 Let 6, be any positive solution to the inequality 


GiAO,OF) — 6 
IIE ELD a, 13.42 
ô E Yer ( a) 
There are universal positive constants (Co, C1, €2) such that for any t 2 ôn, the nonpara- 
metric least-squares estimate f, satisfies the bound 
a 1+ 
If-fIR < int {—*1f - ¢'1R+—*— 6,} forall fe F (13.42) 
yonl- y yi) 


nton 
with probability greater than 1 — cje ° =. 


Remarks: Note that the guarantee (13.42b) is actually a family of bounds, one for each 
f € F. When f* € F, then we can set f = f*, so that the bound (13.42b) reduces to 
asserting that IF- fÈ < tô, with high probability, where ô, satisfies our previous critical 
inequality (13.17). Thus, up to constant factors, we recover Theorem 13.5 as a special case 
of Theorem 13.13. In the more general setting when f* ¢ F, setting t = 6, and taking the 
infimum over f € F yields an upper bound of the form 


IF- È x iof If- f +6. (13.43a) 


Similarly, by integrating the tail bound, we are guaranteed that 


2 
oe K + K o 
ELIA— Fh] S inf If - a + —. (13.43b) 


These forms of the bound clarify the terminology oracle inequality: more precisely, the 
quantity inf seg |f — f*|2 is the error achievable only by an oracle that has access to un- 
corrupted samples of the function f*. The bound (13.43a) guarantees that the least-squares 
estimate f has prediction error that is at most a constant multiple of the oracle error, plus a 
term proportional to 6. The term inf fez || f — fÈ can be viewed a form of approximation 
error that decreases as the function class F grows, whereas the term 62 is the estimation 
error that increases as becomes more complex. This upper bound can thus be used to 
choose F as a function of the sample size so as to obtain a desirable trade-off between the 
two types of error. We will see specific instantiations of this procedure in the examples to 
follow. 
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13.3.1 Some examples of oracle inequalities 


Theorem 13.13 as well as oracle inequality (13.43a) are best understood by applying them 
to derive explicit rates for some particular examples. 


Example 13.14 (Orthogonal series expansion) Let (@,,)~_, be an orthonormal basis of 
L?(P), and for each integer T = 1,2,..., consider the function class 


Fali := { yee | Sasi < (13.44) 


m=1 m=1 


and let fbe the constrained least-squares estimate over this class. Its computation is straight- 
forward: it reduces to a version of linear ridge regression (see Exercise 13.10). 

Let us consider the guarantees of Theorem 13.13 for fas an estimate of some function 
f* in the unit ball of L?(P). Since (¢m)~_, is an orthonormal basis of L? (P), we have f* = 
Vi-1 Fm for some coefficient sequence (G;,)”_,. Moreover, by Parseval’s theorem, we have 
the equivalence || f* i= E OA < 1, and a retcaighiicrward calculation yields that 


o0 


Werk >) O foreach T = 1,2,.... 


inf 
EF otho(1;T 
FE Fonto (l;T) m=T+1 


T 


Moreover, this infimum is achieved by the truncated function fr = nat On 


cise 13.10 for more details. 

On the other hand, since the estimator over Yortno(1; T) corresponds to a form of ridge 
regression in dimension T, the coe ulatone from Example 13.8 imply that the critical equa- 
tion (13.42a) is satisfied by 62 =~ o° Z T, Setting f = fr in the oracle inequality (13.43b) and 
then taking expectations over the covariates X = {x;}"_, yields that the least-squares estimate 


f over F stho( 1; T) satisfies the bound 


& dm; see Exer- 


xwlllf- R] x 5 (6,7 +0? — (13.45) 
m=T+1 

This oracle inequality allows us to choose the parameter T, which indexes the number of 

coefficients used in our basis expansion, so as to balance the approximation and estimation 

errors. 

The optimal choice of T will depend on the rate at which the basis coefficients (@7,)”_, 
decay to zero. For example, suppose that they exhibit a polynomial decay, say |@,| < Cm™° 
for some a > 1/2. In Example 13.15 to follow, we provide a concrete instance of such poly- 
nomial decay using Fourier coefficients and a-times-differentiable functions. Figure 13.4(a) 
shows a plot of the upper bound (13.45) as a function of T, with one curve for each of the 
sample sizes n € {100, 250, 500, 1000}. The solid markers within each curve show the point 
T* = T*(n) at which the upper bound is minimized, thereby achieving the optimal trade- 
off between approximation and estimation errors. Note how this optimum grows with the 
sample size, since more samples allow us to reliably estimate a larger number of coeffi- 
cients. % 


As a more concrete instantiation of the previous example, let us consider the approxima- 
tion of differentiable functions over the space L”[0, 1]. 
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Error vs dimension Error vs dimension 
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Figure 13.4 Plot of upper bound (13.45) versus the model dimension 7, in all cases 
with noise variance o° = 1. Each of the four curves corresponds to a different sample 
size n € {100, 250, 500, 1000}. (a) Polynomial decaying coefficients |0},] < m-!. (b) 
Exponential decaying coefficients |6*,| < e7”"/. 


Example 13.15 (Fourier bases and differentiable functions) Define the constant function 
oo(x) = 1 for all x € [0,1], and the sinusoidal functions 


m(x) t= V2 cos(2mnx) and On(X) = V2 sin(2mrx) for m = 1,2,.... 


It can be verified that the collection {ġo} U {Øm}z; U Om) , forms an orthonormal basis of 
L’[0, 1]. Consequently, any function f* € L[0, 1] has the series expansion 


o0 


f= 05 + Ds {OP + O Pn} 


m=1 


For each M = 1,2,..., define the function class 
M a M 2 
GA; M) = {Bo + X (Bnn + Bndn) | Bs + > (Be, +B) < 1). (13.46) 
m=1 m=1 


Note that this is simply a re-indexing of a function class -Fono(1;T) of the form (13.44) 
with T =2M +1. 

Now suppose that for some integer œ > 1, the target function f* is @-times differentiable, 
and suppose that if [(f*) (x) dx < R for some radius R. It can be verified that there is a 
constant c such that 6) + BX < Pe for all m > 1, and, moreover, we can find a function 
f € GU; M) such that 
c’R 


A (13.47) 


If- fÈ < 


See Exercise 13.11 for details on these properties. 
Putting together the pieces, the bound (13.45) combined with the approximation-theoretic 
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guarantee (13.47) implies that the least-squares estimate Fu over J (1; M) satisfies the bound 


PROS 1 (2M + 1) 
Exwlllfu — f H 5 M22 +o? n : 


Thus, for a given sample size n and assuming knowledge of the smoothness œ and noise 
variance a”, we can choose M = M(n, œ, o°) so as to balance the approximation and esti- 
mation error terms. A little algebra shows that the optimal choice is M ~ (n/ o?) =, which 
leads to the overall rate 


aller 


As will be clarified in Chapter 15, this no msl decay in mean-squared error is the best that 
can be expected for general univariate a-smooth functions. 4 


We now turn to the use of oracle inequalities in high-dimensional sparse linear regression. 


Example 13.16 (Best sparse approximation) Consider the standard linear model y; = 
Jo:(xi) + owi, where f(x) := (6", x) is an unknown linear regression function, and w; ~ 
N(O, 1) is an i.i.d. noise sequence. For some sparsity index s € {1,2,...,d}, consider the 
class of all linear regression functions based on s-sparse vectors—namely, the class 


Foals) := {fo 10 € RY, [l6llo < 5}, 


where |lêllo = em 1[@; + 0] counts the number of non-zero coefficients in the vector 6 € R4. 
Disregarding computational considerations, a natural estimator is given by 


0€arg min 
OEF sparl. 


y= XAll:, (13.48) 


corresponding to performing least squares over the set of all regression vectors with at most 
s non-zero coefficients. As a corollary of Theorem 13.13, we claim that the L?(P,,)-error of 
this estimator is upper bounded as 


slog(“) 
fe feli S inf If- feli + —— (13.49) 
0€ Fopar(S) n 


on 


with high probability. Consequently, up to constant factors, its error is as good as the best 
s-sparse predictor plus the penalty term 62, arising from the estimation error. Note that the 
penalty term grows linearly with the sparsity s, but only logarithmically in the dimension d, 
so that it can be very small even when the dimension is exponentially larger than the sample 
size n. In essence, this result guarantees that we pay a relatively small price for not knowing 
in advance the best s-sized subset of coefficients to use. 

In order to derive this result as a corollary of Theorem 13.13, we need to compute the 
local Gaussian complexity (13.42a) for our function class. Making note of the inclusion 
OF spar(S) C Fepar(2s), we have G, (6; OF spar(S)) < Gn(53 Fepar(2s)). Now let S C {1,2,...,d} 
be an arbitrary 2s-sized subset of indices, and let X; € R’?* denote the submatrix with 
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columns indexed by S. We can then write 


T w!Xs Os 
Gi(O; Fspar(2s)) = E,[ max Z,(S)], where Z,(S):= sup |— l. 
Eas Os eR n 
[Xs 8s ll2/-Wns6 
Viewed as a tunen of the standard Gaussian vector w, the variable Z,,(S ) is Lipschitz with 


constant at most -~ =, from which Theorem 2.26 implies the tail bound 


PIZ (S) > E[Z,(S)] +16] <e"* forall t> 0. (13.50) 


We now upper bound the expectation. Consider the singular value decomposition Xs = 
UDVT, where U € R” and V € R?” are matrices of left and right singular vectors, re- 
spectively, and D € R**** is a diagonal matrix of the singular values. Noting that ||Xs s||2 = 
|[DV‘6s||2, we arrive at the upper bound 


E[Z,(S)] < E lu Ul wll. 
(2n(S)1 < E sup | =(U"w, BY] < = E[ NUT wih] 


lBll2 <ô 


Since w ~ N(0, I,„) and the matrix U has orthonormal columns, we have UTw ~ N(0,L,), 
and therefore EllUTw]l < v2s. Combining this upper bound with the earlier tail bound 
(13.50), an application of the union bound yields 


2s 2 
P paz) > (= SE A eT, valid for all t > 0. 
ISl=2s 2s 


By integrating this tail bound, we find that 


F[ maxsi-2s Z(S)] _ Gul) _ T : flog (4) _ slog) 
ô ô ~ Va n 2 n 


slog(ed/: 
g? Sete /s) 


so that the critical inequality (13.17) is satisfied for 6? ~ , as claimed. & 


13.3.2 Proof of Theorem 13.13 


We now turn to the proof of our oracle inequality; it is a par auve straightforward extension 
of the proof of Theorem 13.5. Given an arbitrary f € ¥, since it is feasible and fi is optimal, 
we have 


1x — az İ ċ te a 
= SPE . < — ee DA. 
F 20 fu < 5 2,0 Fx) 
Using the relation y; = f*(x;) + 7w;, some algebra then yields 
lms lie wo. OOS 
SIAR < FIF- f'k +Z > wā], (13.51) 


where we have defined A := f- f* and A= ex f. a 
It remains to analyze the term on the right-hand side involving A. We break our analysis 
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into two cases. 


Case l: First suppose that ||Al|, < vt6d,. We then have 
IAR = IF- f° = IG f) + All, 
O ~~ on 2 
< IF- f'l + Vt5n} 
Gi) eee 2 
< (1+ 28F = F'G + (1 + Zin 


where step (i) follows from the triangle inequality, and step (ii) is valid for any 6 > 0, us- 
ing the Fenchel—Young inequality. Now setting 8 = ee for some y € (0,1), observe that 


14+26= =, and 1 + 5 = zr < TE so that the stated claim (13.42b) follows. 


Case 2: Otherwise, we may assume that |All, > vto,. Noting that the function A belongs 
to the difference class 0.F := ¥ — F, we then apply Lemma 13.12 with u = -té, and 
KH = 0F. Doing so yields that 


P[2i— > wA] > 4-06, Alh] < e. 


Combining with the basic inequality (13.51), we find that, with probability at least 1 De 
the squared error is bounded as 


WAI, < IF- f°, + 4 vt, Alla 
< IF- FIÈ + 4 yt, (IA + IF- Ul 


where the second step follows from the triangle inequality. Applying the Fenchel-Young 
inequality with parameter 6 > 0, we find that 


~ ~ 4 
and P 
4 ViönlF = F'I < ABP = fll + Gt 


Combining the pieces yields 


N 5 PE: 
All, < 1 + 4B) — F IE + 46A + =tôn. 


B 
For all 8 € (0, 1/4), rearranging yields the bound 
~ 14+46 ~ 
IAI < Wari tôn. 
14g! i BCU — 4B) 


Setting y = 48 yields the claim. 
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13.4 Regularized estimators 


Up to this point, we have analyzed least-squares estimators based on imposing explicit con- 
straints on the function class. From the computational point of view, it is often more conve- 
nient to implement estimators based on explicit penalization or regularization terms. As we 
will see, these estimators enjoy statistical behavior similar to their constrained analogs. 

More formally, given a space ¥ of real-valued functions with an associated semi-norm 
|| - I~, consider the family of regularized least-squares problems 


a 1d 
F€ argmin{— X, - fd) + Aulfll5}, (13.52) 
i=1 


SJEF 


where 4, > 0 is a regularization weight to be chosen by the statistician. We state a general 
oracle-type result that does not require f* to be a member of F. 


13.4.1 Oracle inequalities for regularized estimators 


Recall the compact notation 0.Y = F —.F. As in our previous theory, the statistical error 
involves a local Gaussian complexity over this class, which in this case takes the form 


1 n 
Gn(ô; Bax(3)) := | sup EZ relh (13.53) 
EOF n il 
lisle <3, IIgln<6 


where w; ~ N(0, 1) are i.i.d. variates. When the function class ¥Y and rescaled ball By z(3) = 
{8 € OF | \lgllz < 3} are clear from the context, we adopt G,,(6) as a convenient shorthand. 
For a user-defined radius R > 0, we let 6, > 0 be any number satisfying the inequality 


G,(0) < es 
ô 20 


(13.54) 


Theorem 13.17 Given the previously described observation model and a convex func- 
tion class F, suppose that we solve the convex program (13.52) with some regulariza- 
tion parameter A, > 26. Then there are universal positive constants (c ie c) such that 


If — fill; < co int ILE — fill; + cR oh + An} (13.55a) 


R262 
with probability greater than 1 — cze ® = . Similarly, we have 


Elf - FIE < ch itir — f'E tci R {67 + A,}. (13.55b) 
EES 
< d 


We return to prove this claim in Section 13.4.4. 


13.4.2 Consequences for kernel ridge regression 


Recall from Chapter 12 our discussion of the kernel ridge regression estimate (12.28). There 
we showed that this KRR estimate has attractive computational properties, in that it only re- 
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quires computing the empirical kernel matrix, and then solving a linear system (see Propo- 
sition 12.33). Here we turn to the complementary question of understanding its statistical 
behavior. Since it is a special case of the general estimator (13.52), Theorem 13.17 can be 
used to derive upper bounds on the prediction error. Interestingly, these bounds have a very 
intuitive interpretation, one involving the eigenvalues of the empirical kernel matrix. 

From our earlier definition, the (rescaled) empirical kernel matrix K € RR” is symmetric 
and positive semidefinite, with entries of the form K;; = K(x, x;)/n. It is thus diagonaliz- 
able with non-negative eigenvalues, which we take to be ordered as ft; > fiz > +--+ > Ân = 0. 
The following corollary of Theorem 13.17 provides bounds on the performance of the kernel 
ridge regression estimate in terms of these eigenvalues: 


Corollary 13.18 For the KRR estimate (12.28), the bounds of Theorem 13.17 hold for 
any ôn > 0 satisfying the inequality 


FE [Z mnsa T (13.56) 


We provide the proof in Section 13.4.3. Before doing so, let us examine the implications of 
Corollary 13.18 for some specific choices of kernels. 


Example 13.19 (Rates for polynomial regression) Given some integer m > 2, consider the 
kernel function K(x, z) = (1 + xz)""!. The associated RKHS corresponds to the space of 
all polynomials of degree at most m — 1, which is a vector space with dimension m. Conse- 
quently, the empirical kernel matrix K € [k’“” can have rank at most min{n, m}. Therefore, 
for any sample size n larger than m, we have 


Jenin byt anne NE 


Consequently, the critical inequality (13.56) is satisfied for all 6 = £ 4/7, so that the KRR 
estimate satisfies the bound 


lees an Ie fik+o =, 
Wfllu< n 


both in high probability and in expectation. This bound is intuitively reasonable: since the 
space of m — 1 polynomials has a total of m free parameters, we expect that the ratio m/n 
should converge to zero in order for consistent estimation to be possible. More generally, 
this same bound with m = r holds for any kernel function that has some finite rank r > 1. & 


We now turn to a kernel function with an infinite number of eigenvalues: 


Example 13.20 (First-order Sobolev space) Previously, we introduced the kernel function 
K(x, z) = min{x, z} defined on the unit square [0, 1] x [0, 1]. As discussed in Example 12.16, 
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the associated RKHS corresponds to a first-order Sobolev space 


H'[0, 1] :={f: [0,1] > R | f() = 0, and f is abs. cts. with f’ € L7[0, 1]}. 
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As shown in Example 12.23, the kernel integral operator associated with this space has the 
eigendecomposition 


2 
P = sin(@x/ VE) Hi = (Gee) 


=~ for j = 1,2,... 
j- Dr or J f ’ 


so that the eigenvalues drop off at the rate j7?. As the sample size increases, the eigenvalues 
of the empirical kernel matrix K approach those of the population kernel operator. For the 


purposes of calculation, Figure 13.5(a) suggests the heuristic of assuming that fi; 


o, Empirical eigenspectrum (random) 
10°% i 


10° Empirical eigenspectrum (random) 
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Figure 13.5 Log—log behavior of the eigenspectrum of the empirical kernel matrix 
based on n = 2000 samples drawn i.i.d. from the uniform distribution over the inter- 
val X for two different kernel functions. The plotted circles correspond to empirical 
eigenvalues, whereas the dashed line shows the theoretically predicted drop-off of 
the population operator. (a) The first-order Sobolev kernel K(x, z) = min{x, z} on the 


interval X = [0, 1]. (b) The Gaussian kernel K(x, z) = exp(— 
the interval X = [-1, 1]. 


(x=) 


207 


) with o = 0.5 on 


Under our heuristic assumption, we have 


where k is the smallest positive integer such that ck-? < 6”. Upper bounding the final sum 


3 In particular, Proposition 14.25 shows that the critical radii computed using the population and empirical 
kernel eigenvalues are equivalent up to constant factors. 


< $ for 
some universal constant c. Our later analysis in Chapter 14 will provide a rigorous way of 
making such an argument.* 
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by an integral, we have c X'p j? < c ff? dt < ck! < kô?, and hence 


<> ap) min {6?, ij} < ots <c ” Po 


Consequently, the critical inequality (13.56) is satisfied by 6° ~ 


2 4\2/3 
(za) 


Rape OF equivalently ax 
. Putting together the pieces, Corollary 13.18 implies that the KRR estimate will 


satisfy the upper bound 


Fare a IPF + RO, = inf I — fl R ey 
both with high probability and in expectation. As will be seen later in Chapter 15, this rate 
is minimax-optimal for the first-order Sobolev space. 4 


Example 13.21 (Gaussian kernel) Now let us consider the same issues for the Gaussian 


kernel K(x, z) = pe on the square [—1, 1] x [-1, 1]. As discussed in Example 12.25, the 
eigenvalues of the associated kernel operator scale as u; ~ e718) as j => +00. Accord- 
ingly, let us adopt the heuristic that the empirical eigenvalues satisfy a bound of the form 
Âj < coe™® i8), Figure 13.5(b) provides empirical justification of this scaling for the Gaus- 
sian kernel: notice how the empirical plots on the log-log scale agree qualitatively with the 
theoretical prediction. Again, Proposition 14.25 in Chapter 14 allows us to make a rigorous 
argument that reaches the conclusion sketched here. 
Under our heuristic assumption, for a given ô > 0, we have 


1 £ 1 z 
— min{ô?, û;} < — min{8?, Co ew“ J os J} 
F y> Fy 
i ké2 + Co 5 e ci slog j, 
yn 


j=k+1 


IA 


where k is the smallest positive integer such that cye“'k 98k < 67. 


Some algebra shows that the critical inequality will be satisfied by & ~ e beats =~. so that 
nonparametric regression over the Gaussian kernel class satisfies the bad. 
F 2 x 2 RS ; #112 log( w 
Ea e ME R = int Eee m 
™ filers Ifll<R 


for some universal constant c. The estimation error component of this upper bound is very 
fast—within a logarithmic factor of the n~' parametric rate—thereby revealing that the Gaus- 
sian kernel class is much smaller than the first-order Sobolev space from Example 13.20. 
However, the trade-off is that the approximation error decays very slowly as a function of 
the radius R. See the bibliographic section for further discussion of this important trade-off. 

& 
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13.4.3 Proof of Corollary 13.18 


The proof of this corollary is based on a bound on the local Gaussian complexity (13.53) of 
the unit ball of an RKHS. Since it is of independent interest, let us state it as a separate result: 


Lemma 13.22 Consider an RKHS with kernel function K. For a given set of design 
points {x;}"_,, let fi, > fin = +++ = Ân = O be the eigenvalues of the normalized kernel 
matrix K with entries Ki; = K(x;, x;)/n. Then for all 6 > 0, we have 


2 n 
sup wif (x) Ze min{6*, fi;}, (13.57) 
Ifll< z |< n 2, : 
Allnso 
where w; ~ N(0, 1) are i.i.d. Gaussian variates. 
4 
Proof It suffices to restrict our attention to functions of the form 
1 n 
gC) = — aK, xi), (13.58) 
T 


some vector of coefficients œ € R”. Indeed, as argued in our proof of Proposition 12.33, any 
function f in the Hilbert space can be written in the form f = g + g,, where g, is a function 
orthogonal to all functions of the form (13.58). Thus, we must have g, (x) = (e1, KC, xim 
= 0, so that neither the objective nor the constraint ||f||, < 6 have any dependence on g,. 
Lastly, by the Pythagorean theorem, we have ||fIĝ, = llgllz, + llg.llf,, so that we may assume 
without loss of generality that g, = 0. 

In terms of the coefficient vector a € R” and kernel matrix K, the constraint ||g||, < 6 is 
equivalent to ||Kal| < 6, whereas the inequality ||g||?, < 1 corresponds to IIgllz, = aKa <1. 
Thus, we can write the local Gaussian complexity as an optimization problem in the vector 
a € R” with a linear cost function and quadratic constraints—namely, 


1 T 

G6) = Ti al a \w"Kol]. 
aK? a<? 

Since the kernel matrix K is symmetric and positive semidefinite, it has an eigendecom- 

position* of the form K = UTAU, where U is orthogonal and A is diagonal with entries 

ft, = fo 2 -+ = ft, > 0. If we then define the transformed vector 6 = Ka, we find (following 

some algebra) that the complexity can be written as 


n g 
G.(0) = -Enl sup WAN) where D:= {Be R" | [ipl <6, ye ail 


j=l 


4 In this argument, so as to avoid potential division by zero, we assume that K has strictly positive eigenvalues; 
otherwise, we can simply repeat the argument given here while restricting the relevant summations to positive 
eigenvalues. 
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is the intersection of two ellipses. Now define the ellipse 
6:= lg € k" | ` ni; < 2}, where 77; = max{6~*, 5"). 
jEl 


We claim that D c &; indeed, for any 8 € D, we have 


n n B n B 
X max(o, Aze < D < 2. 
j=l 


j=l j=l 


Applying Holder’s inequality with the norm induced by & and its dual, we find that 


1 2 nw 
Gn ô) < -E ( > X < AE =, 
C vn [sup | ER I n mi 


Jensen’s inequality allows us to move the expectation inside the square root, so that 


Gan (ô) < E 
n 


Bean) ag = min{6’, ñ ;} yields the claim. 


E[ws] 


and substituting (77;)"'! = (max{é 


13.4.4 Proof of Theorem 13.17 


Finally, we turn to the proof of our general theorem on regularized M-estimators. By rescal- 
ing the observation model by R, we can analyze an equivalent model with noise variance 
(£)’, and with the rescaled approximation error inf), <1 If — f*I}. Our final mean-squared 
error then should be multiplied by R? so as to obtain a result for the original problem. 

In order to keep the notation streamlined, we introduce the shorthand & = o /R. Let f be 
any element of F such that wal g < 1. At the end of the proof, we optimize this choice. 


Since f and f are optimal and feasible (respectively) for the program (13.52), we have 
1 n oA Z 1 x 7 = 
5 2, (yi — FDY + Anllflle < z 2 Oi — FAD + Anll fE. 


Defining the errors A= f= f* and A= f- fand recalling that y; = f*(x;)+õw;, performing 
some algebra yields the modified basic inequality 


Le l= wo. FU ~ = Ps 
sla, <5ll/-f I% + =l 2, wiA(x)| + Alf — IFE} (13.59) 


where w; ~ NO, 1) are i.i.d. Gaussian variables. 
Since ||/||¢ < 1 by assumption, we certainly have the possibly weaker bound 


lan Wee m2 oar, ~ 
lial; < SIF fll + wl 2 WAC + An. (13.60) 
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Consequently, if All < vVtôn, we can then follow the same argument as in the proof of The- 
orem 13.13, thereby establishing the bound (along with the extra term 4, from our modified 
basic inequality). 

Otherwise, we may assume that ||Al|,, > Vtôn, and we do so throughout the remainder of 
the proof. We now split the argument into two cases. 


Case 1: First, suppose that II? g < 2. The bound II z < 1 together with the inequality 
\|fll@ < 2 implies that ||A||z < 3. Consequently, by applying Lemma 13.12 over the set of 
functions {g € 0-F | |lg||z < 3}, we conclude that 


=| X wACa)| < co VtdullAll, with probability at least 1 — e~”. 
n 
i=1 


By the triangle inequality, we have 


2 VtonlAlln < 2 Vt6n Alla + 2 VtOnlLf — f“ 


ee IF- fi 
< 2 4tô,llAll, + 206, + =o (13.61) 


where the second step uses the Fenchel-Young inequality. Substituting these upper bounds 
into the basic inequality (13.60), we find that 


LAIR <E + coll f — FIE + 2cotdn + 2co Vt5nllAlln + Ans 


so that the claim follows by the quadratic formula, modulo different values of the numerical 
constants. 


Case 2: Otherwise, we may assume that IFI gz>2>l> IFI z. In this case, we have 
NAi -Ifi = Mfl + fla Mfl- Ilflla} < fla- fia). 
>l <0 <0 
Writing f = f- +A and noting that IIfllz > IIA- IL fll g by the triangle inequality, we obtain 
AAF -IA < lfl ~ Ills} 

< An{2IIflle — llAlle} 

< An{2- Alls}, 
where we again use the bound II fll g < lin the final step. 


Substituting this upper bound into our modified basic inequality (13.59) yields the upper 
bound 


l= l o eee ÖN ee P 
DE 2 WA)| + 2an — AnllAlLe- (13.62) 


Our next step is to upper bound the stochastic component in the inequality (13.62). 
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r 


Lemma 13.23 There are universal positive constants (c, c2) such that, with proba- 


nôž 
bility greater than 1 — cie 2, we have 


Cane 1 
— > wA%)| < 26nllAll, + 25,1IAlle + IA, 13.63 
Pa D| < 25,1] + 2AN + lll (13.63) 


a bound that holds uniformly for all A € 0F with \|A\|z > 1. 


We now complete the proof of the theorem using this lemma. We begin by observing that, 
since IIfllz < land IIfllz > 2, the triangle inequality implies that ||Al|.¢ > IIfllz — IIfllz > 1, 
so that Lemma 13.23 may be applied. Substituting the upper bound (13.63) into the inequal- 
ity (13.62) yields 


eee a ee = = (IAI 
HAIR < <If — FI + 25nllAll, + {262 — An}IIAlle +24, + 
2 2 16 
ee. ~ IAI 
< zl -S Ij, + 25nI|Alln + 2An + T (13.64) 


where the second step uses the fact that 282 — 2, < 0 by assumption. 
Our next step is to convert the terms involving A into quantities involving A: in particular, 
by the triangle inequality, we have ||All, < IIIf — fila + |All». Thus, we have 


26nllAlln < 26allf — "Ih + 25nllAlln, (13.65a) 
and in addition, combined with the inequality (a + b)? < 2a? + 2b’, we find that 
WANE Ui E i ne, ets 
16S gill fille + All. (13.65b) 


Substituting inequalities (13.65a) and (13.65b) into the earlier bound (13.64) and performing 
some algebra yields 


(4 — yale < (4 + AMF FB + 26,LF — Fell + 26A + 2Ap. 


The claim (13.55a) follows by applying the quadratic formula to this inequality. 


It remains to prove Lemma 13.23. We claim that it suffices to prove the bound (13.63) for 
functions g € 0¥ such that |g||¢ = 1. Indeed, suppose that it holds for all such functions, 
and that we are given a function A with ||A||z > 1. By assumption, we can apply the in- 
equality (13.63) to the new function g := A/||A||.z, which belongs to ôF by the star-shaped 
assumption. Applying the bound (13.63) to g and then multiplying both sides by ||Al|z, we 
obtain 
1 JAI 


o n 
a A(x) < ci SnllAlla + C2 EAL + — 
han | <1 SullAll + c2 lll + T aje 


1 
< c1 bnllAlln + C2 Olle + TAAIE 


where the second inequality uses the fact that ||A]| > > 1 by assumption. 
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In order to establish the bound (13.63) for functions with ||g||z = 1, we first consider it 
over the ball {||g||,, < t}, for some fixed radius t > 0. Define the random variable 


Z dato] 


Viewed as a function of the standard Gaussian vector w, it is Lipschitz with parameter at 
most õt/yn. Consequently, Theorem 2.26 implies that 


Z,(t) = sup 
lIglla <1 
IlgllnSt 


P[Z,(t) > EIZO] + u] < #7., (13.66) 


We first derive a bound for t = 6,. By the definitions of G,, and the critical radius, we have 
E[Z,(6n)] < FGnlOn) < 6. Setting u = ô, in the tail bound (13.66), we find that 


2 
nô? 


PIZ, (6n) = 262] < e. (13.67a) 


On the other hand, for any t > 6,, we have 


(ii) 
< tô 


rei n» 


HZA] = 0G,(t) = t 


FGilt) © FGulSn) 
to zi 


where inequality (i) follows from Lemma 13.6, and inequality (ii) follows by our choice of 
ôn. Using this upper bound on the mean and setting u = t /32 in the tail bound (13.66) yields 


2 nt? 
P zo > tôn + z <e Cr for each t > ôn. (13.67b) 


We are now equipped to complete the proof by a “peeling” argument. Let & denote the 
event that the bound (13.63) is violated for some function g € 0-¥ with ||gl| = 1. For real 
numbers 0 < a < b, let E(a, b) denote the event that it is violated for some function such 
that ||gll, € [a,b] and ||g||z = 1. For m = 0,1,2,..., define tn = 2"6,. We then have the 
decomposition & = &(0, to) U (U9 Eltm, fn+1)) and hence, by the union bound, 


P[E] < PLEO, to)] + y P[ECGm; tm+1)]. (13.68) 
m=0 
The final step is to bound each of the terms in this summation. Since tọ = 6,, we have 


PIECO, to)] < PIZ, (6,) > 262] < eB, (13.69) 


using our earlier tail bound (13.67a). On the other hand, suppose that E(t, tm+1) holds, mean- 
ing that there exists some function g with ||g||¢ = 1 and ||g|l, € [fm, tm+1] such that 


aX © aa ee 
[i Da miatso| 2 2Balel + 285 + Fell 
© 2 2 
> 2ôntm + 26, + <t 


Gi) 2, l% 
= Online: + 265 + 37 lm 
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where step (i) follows since |lg||, = tm, and step (ii) follows since t,.; = 2tm. This lower 


bound implies that Z„(tm+1) = Ontmer + fs , and applying the tail bound (13.67b) yields 


2 2m+2 52 
mn n2 on 


PEt, tm+1)] < e° 2 = e°? a 


Substituting this inequality and our earlier bound (13.69) into equation (13.68) yields 
nbz 2 22m? 653 |, nde 
P[6] <e = + as e°? # <ce?#, 
m=0 
where the reader should recall that the precise values of universal constants may change 
from line to line. 


13.5 Bibliographic details and background 


Nonparametric regression is a classical problem in statistics with a lengthy and rich history. 
Although this chapter is limited to the method of nonparametric least squares, there are a 
variety of other cost functions that can be used for regression, which might be preferable for 
reasons of robustness. The techniques described this chapter are relevant for analyzing any 
such M-estimator—that is, any method based on minimizing or maximizing some criterion 
of fit. In addition, nonparametric regression can be tackled via methods that are not most 
naturally viewed as M-estimators, including orthogonal function expansions, local poly- 
nomial representations, kernel density estimators, nearest-neighbor methods and scatterplot 
smoothing methods, among others. We refer the reader to the books (Gyorfi et al., 2002; 
Hardle et al., 2004; Wasserman, 2006; Eggermont and LaRiccia, 2007; Tsybakov, 2009) 
and references therein for further background on these and other methods. 

An extremely important idea in this chapter was the use of localized forms of Gaussian 
or Rademacher complexity, as opposed to the global forms studied in Chapter 4. These lo- 
calized complexity measures are needed in order to obtain optimal rates for nonparametric 
estimation problems. The idea of localization plays an important role in empirical process 
theory, and we embark on a more in-depth study of it in Chapter 14 to follow. Local function 
complexities of the form given in Corollary 13.7 are used extensively by van de Geer (2000), 
whereas other authors have studied localized forms of the Rademacher and Gaussian com- 
plexities (Koltchinskii, 2001, 2006; Bartlett et al., 2005). The bound on the localized Rade- 
macher complexity of reproducing kernel Hilbert spaces, as stated in Lemma 13.22, is due to 
Mendelson (2002); see also the paper by Bartlett and Mendelson (2002) for related results. 
The peeling technique used in the proof of Lemma 13.23 is widely used in empirical process 
theory (Alexander, 1987; van de Geer, 2000). 

The ridge regression estimator from Examples 13.1 and 13.8 was introduced by Hoerl and 
Kennard (1970). The Lasso estimator from Example 13.1 is treated in detail in Chapter 7. 
The cubic spline estimator from Example 13.2, as well as the kernel ridge regression estima- 
tor from Example 13.3, are standard methods; see Chapter 12 as well as the books (Wahba, 
1990; Gu, 2002) for more details. The £,-ball constrained estimators from Examples 13.1 
and 13.9 were analyzed by Raskutti et al. (2011), who also used information-theoretic meth- 
ods, to be discussed in Chapter 15, in order to derive matching lower bounds. The results 
on metric entropies of g-convex hulls in this example are based on results from Carl and 
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Pajor (1988), as well as Guédon and Litvak (2000); see also the arguments given by Raskutti 
et al. (2011) for details on the specific claims given here. 

The problems of convex and/or monotonic regression from Example 13.4 are particular 
examples of what is known as shape-constrained estimation. It has been the focus of classi- 
cal work (Hildreth, 1954; Brunk, 1955, 1970; Hanson and Pledger, 1976), as well as much 
recent and on-going work (e.g., Balabdaoui et al., 2009; Cule et al., 2010; Diimbgen et al., 
2011; Seijo and Sen, 2011; Chatterjee et al., 2015), especially in the multivariate setting. 
The books (Rockafellar, 1970; Hiriart-Urruty and Lemaréchal, 1993; Borwein and Lewis, 
1999; Bertsekas, 2003; Boyd and Vandenberghe, 2004) contain further information on sub- 
gradients and other aspects of convex analysis. The bound (13.34) on the sup-norm (Læ) 
metric entropy for bounded convex Lipschitz functions is due to Bronshtein (1976); see also 
Section 8.4 of Dudley (1999) for more details. On the other hand, the class of all convex 
functions f: [0,1] — [0,1] without any Lipschitz constraint is not totally bounded in the 
sup-norm metric; see Exercise 5.1 for details. Guntuboyina and Sen (2013) provide bounds 
on the entropy in the L,-metrics over the range p € [1, co) for convex functions without the 
Lipschitz condition. 

Stone (1985) introduced the class of additive nonparametric regression models discussed 
in Exercise 13.9, and subsequent work has explored many extensions and variants of these 
models (e.g., Hastie and Tibshirani, 1986; Buja et al., 1989; Meier et al., 2009; Ravikumar 
et al., 2009; Koltchinskii and Yuan, 2010; Raskutti et al., 2012). Exercise 13.9 in this chapter 
and Exercise 14.8 in Chapter 14 explore some properties of the standard additive model. 


13.6 Exercises 


Exercise 13.1 (Characterization of the Bayes least-squares estimate) 


(a) Given a random variable Z with finite second moment, show that the function G(t) = 

E[(Z — t)?] is minimized at t = E[Z]. 

(b) Assuming that all relevant expectations exist, show that the minimizer of the population 
mean-squared error (13.1) is given by the conditional expectation f*(x) = E[Y | X = x]. 
(Hint: The tower property and part (a) may be useful to you.) 


(c) Let f be any other function for which the mean-squared error Ey y[(Y — f(X))”] is finite. 
Show that the excess risk of f is given by ||f — f*||3, as in equation (13.4). 


Exercise 13.2 (Prediction error in linear regression) Recall the linear regression model 
from Example 13.8 with fixed design. Show via a direct argument that 


ic See 


n 


valid for any observation noise that is zero-mean with variance o”. 


Exercise 13.3 (Cubic smoothing splines) Recall the cubic spline estimate (13.10) from 


Example 13.2, as well as the kernel function K(x, z) = f (x — y) (z — y), dy from Exam- 
ple 12.29. 
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(a) Show that the optimal solution must take the form 
= ~ = 1 
f(x) = bo + 1x + —= ) K(x, xi) 
vie) 


for some vectors 6 € R? and @ € R”. 
(b) Show that these vectors can be obtained by solving the quadratic program 


1 

(0,@) =arg min {lb — X6 — ynKall; + Ayaka}, 
(Oa)eR2xR" | 2n 

where K € kk” is the kernel matrix defined by the kernel function in part (a), and 

X € R™ is a design matrix with ith row given by [1 x]. 


Exercise 13.4 (Star-shaped sets and convexity) In this exercise, we explore some properties 
of star-shaped sets. 


(a) Show that a set C is star-shaped around one of its points x* if and only if the point 
ax + (1 — a)x* belongs to C for any x € C and any a € [0, 1]. 
(b) Show that a set C is convex if and only if it is star-shaped around each one of its points. 


Exercise 13.5 (Lower bounds on the critical inequality) Consider the critical inequal- 
ity (13.17) in the case f* = 0, so that F* = F. 


(a) Show that the critical inequality (13.17) is always satisfied for & = 40°. 

(b) Suppose that a convex function class F contains the constant function f = 1. Show 
that any ô € (0, 1] satisfying the critical inequality (13.17) must be lower bounded as 
& > min {1,87}. 


Exercise 13.6 (Local Gaussian complexity and adaptivity) This exercise illustrates how, 
even for a fixed base function class, the local Gaussian complexity G,(6; ¥*) of the shifted 
function class can vary dramatically as the target function f* is changed. For each 6 € R”, let 
fo(x) = (0, x) be a linear function, and consider the class Fe (1) = {fo | llêlhi < 1}. Suppose 
that we observe samples of the form 


T 7 o 
yi = folei) + —=wi = 6 + —=wi, 


vn ` (a 
where w; ~ N(O, 1) is an i.i.d. noise sequence. Let us analyze the performance of the £1- 
constrained least-squares estimator 


O= arg m Foi fole))'} = arg min { Eo. 0’) 


fea H 
lan a 


(a) For any fo € Fe (1), show that G,(6; Fi (1) LČ toe for some universal constant c1, 


mgn 


and hence that lio - OR < co with high probability. 
(b) Now consider some fọ with &* € ic ...,€,}—that is, one of the canonical basis vectors. 
Show that there is a universal constant c) such that the local Gaussian complexity is 


bounded as G,(6; F} (1)) < 26 = , and hence that || — Ol < c eei with high 
probability. 
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Exercise 13.7 (Rates for polynomial regression) Consider the class of all (m — 1)-degree 
polynomials 
m-1 
Pm ={f: R> R | @€R"), where fo(x) = > aes 
j=0 
and suppose that f* € Pm. Show that there are universal positive constants (co, C1, C2) such 
that the least-squares estimator satisfies 


2 
> o-mlogn 

PIF- FIE = co | < creme”, 

Exercise 13.8 (Rates for twice-differentiable functions) Consider the function class F of 

functions f: [0,1] — R that are twice differentiable with || fll. + If llo + |If Ilo < C for 

some constant C < oo. Show that there are positive constants (co, C1, C2), which may depend 


on C but not on (n, o°), such that the non-parametric least-squares estimate satisfies 


— 2 o? 4 Pi 251/5 
Pllif- Fk Seo F 1s ce n, 
(Hint: Results from Chapter 5 may be useful to you.) 


Exercise 13.9 (Rates for additive nonparametric models) Given a convex and symmetric 
class Y of univariate functions g: R — R equipped with a norm ||- Ilg, consider the class of 
additive functions over R, namely 


d 
Fea ={f: Ri >RI f=) g; forsome g; € Y with ligile < 1). (13.70) 

j=l 
Suppose that we have n i.i.d. samples of the form y; = f*(x;) + ow;, where each x; = 
(xi1,.-.5Xig) € R4, w; ~ N(O, 1), and f* := Si g is some function in Faa, and that we 


estimate f* by the constrained least-squares estimate 


F:= arg min [E X 0- fD} 
i=1 


fe Faid 


For each j = 1,...,d, define the jth-coordinate Gaussian complexity 


1 n 
Gn,j(652F) = E [ s, F 2, wigi(%is)|], 
lgs 7 


and let ô, ; > 0 be the smallest positive solution to the inequality Srl < È. 


er 


n d 
| D wid(x)| < dtôn, max +2 V t6n,max ( Dy [A ills) 
i=l j=l 


with probability at least 1 — cidem, (Note that f= Sine for some g; € J, so 
that the function A i=8j- 8; corresponds to the error in coordinate j, and A:= Dai A J 
is the full error function.) 
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(b) Suppose that there is a universal constant K > 1 such that 


n d 
| > isis VEILS gih forall g; e4. 
j=l j=l 


Use this bound and part (a) to show that || f — f*IÈ] < c3 K d 82 max With high probability. 
Exercise 13.10 (Orthogonal series expansions) Recall the function class Fortno(1; T) from 
Example 13.14 defined by orthogonal series expansion with T coefficients. 


(a) Given a set of design points {x,,...,x,}, define the n x T matrix ® = @(x1) with 


(i, jth entry ®;; = ¢;(x;). Show that the nonparametric least-squares estimate f over 
Fortho(1; T) can be obtained by solving the ridge regression problem 


1 
ind —|ly — © 4l? + A, IAI 
min {i IIb + All k} 


for a suitable choice of regularization parameter 4, > 0. 
(b) Show that inf ez,,,,,a-7) If — FIE = F @ 


jari J 
Exercise 13.11 (Differentiable functions and Fourier coefficients) For a given integer a > 1 
and radius R > 0, consider the class of functions F(R) c L?[0, 1] such that: 


e The function f is a-times differentiable, with f (f(x)? dx < R. 
e It and its derivatives satisfy the boundary conditions f”(0) = f(1) = 0 for all j = 
O,1,...,@. 


(a) Fora function f € F,(R) {I fllo < 1}, let {80, (Bm, Bm),_,} be its Fourier coefficients as 
previously defined in Example 13.15. Show that there is a constant c such that £2, +B, < 
È for all m > 1. 


(b) Verify the approximation-theoretic guarantee (13.47). 


14 


Localization and uniform laws 


As discussed previously in Chapter 4, uniform laws of large numbers concern the deviations 
between sample and population averages, when measured in a uniform sense over a given 
function class. The classical forms of uniform laws are asymptotic in nature, guaranteeing 
that the deviations converge to zero in probability or almost surely. The more modern ap- 
proach is to provide non-asymptotic guarantees that hold for all sample sizes, and provide 
sharp rates of convergence. In order to achieve the latter goal, an important step is to localize 
the deviations to a small neighborhood of the origin. We have already encountered a form of 
localization in our discussion of nonparametric regression from Chapter 13. In this chapter, 
we turn to a more in-depth study of this technique and its use in establishing sharp uniform 
laws for various types of processes. 


14.1 Population and empirical L?-norms 


We begin our exploration with a detailed study of the relation between the population and 
empirical L?-norms. Given a function f: X — R and a probability distribution P over X, 
the usual L7(P)-norm is given by 


Ife = | PEIP@s) = EL), (14.1) 
x 
and we say that f € L7(P) whenever this norm is finite. When the probability distribution P 
is clear from the context, we adopt || f||2 as a convenient shorthand for || f||z2(p). 


Given a set of n samples {x;}_, := {x1, X2, . . - , Xn}, each drawn i.i.d. according to P, con- 
sider the empirical distribution 


1 n 
P(x) i= —- > ôx 
(x) = = 2, (x) 
that places mass 1 /n at each sample. It induces the empirical L?-norm 
1 n 
Ifike) == >) PG) = T fP)P (dx). (14.2) 
i=l 


Again, to lighten notation, when the underlying empirical distribution P, is clear from con- 
text, we adopt the convenient shorthand || fl, for |Ifllz2@,)- 


In our analysis of nonparametric least squares from Chapter 13, we provided bounds on 


453 


454 Localization and uniform laws 


the L(P,,)-error in which the samples {x}; were viewed as fixed. By contrast, throughout 
this chapter, we view the samples as being random variables, so that the empirical norm is 
itself a random variable. Since each x; ~ P, the linearity of expectation guarantees that 


FLILSIIZ] = E E $o = |Ifl for any function f € L?(P). 
M 


Consequently, under relatively mild conditions on the random variable f(x), the law of 
large numbers implies that || /||? converges to || f Iž. Such a limit theorem has its usual non- 
asymptotic analogs: for instance, if the function f is uniformly bounded, that is, if 


IIflloo := sup |f(x)| < b for some b < œ, 
XEX 


then Hoeffding’s inequality (cf. Proposition 2.5 and equation (2.11)) implies that 


PIIA- IAB] > t < 263. 


As in Chapter 4, our interest is in extending this type of tail bound—valid for a single 
function f—to a result that applies uniformly to all functions in a certain function class 
F. Our analysis in this chapter, however, will be more refined: by using localized forms of 
complexity, we obtain optimal bounds. 


14.1.1 A uniform law with localization 


We begin by stating a theorem that controls the deviations in the random variable 
i! SF ln — If l|, when measured in a uniform sense over a function class .¥. We then illus- 
trate some consequences of this result in application to nonparametric regression. 

As with our earlier results on nonparametric least squares from Chapter 13, our result is 
stated in terms of a localized form of Rademacher complexity. For the current purposes, it 
is convenient to define the complexity at the population level. For a given radius ô > 0 and 
function class .¥, consider the localized population Rademacher complexity 


R, (ô; F) = al sup i- > eif (xi) | (14.3) 
ine = 


where {x;}', are i.i.d. samples from some underlying distribution P, and {¢;}!_, are i.i.d. Rade- 
macher variables taking values in {—1, +1} equiprobably, independent of the sequence {x;}'"_,. 

In the following result, we assume that F is star-shaped around the origin, meaning that, 
for any f € F and scalar a € [0,1], the function af also belongs to F. In addition, we 
require the function class to be b-uniformly bounded, meaning that there is a constant b < co 


such that |||. < b for all f € F. 
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Theorem 14.1 Given a star-shaped and b-uniformly bounded function class F, let 
6, be any positive solution of the inequality 


(14.4) 
Then for any t = ôn, we have 
1 r 
MA- < SIIB +> forall fe F (14.5a) 


with probability at least 1 — cie 2; = . If in addition nô? > = log(4log(1/6,)), then 


liLflln -Ilfll|<cod, forall fe F (14.5b) 


1 15% 
with probability at least 1— cje °”. ) 
< 


It is worth noting that a similar result holds in terms of the localized empirical Rademacher 
complexity, namely the data-dependent quantity 


PS = 1d 
Ry(5) = Rô; F) := E,| sp s X afl], (14.6) 
Ei i=1 
WFlln<6 


and any positive solution 6, to the inequality 


ox 6 
R, (ô) < p (14.7) 


Since the Rademacher complexity R, depends on the data, this critical radius ĝ, is a ran- 
dom quantity, but it is closely related to the deterministic radius 6, defined in terms of the 
population Rademacher complexity (14.3). More precisely, let ô, and 6, denote the small- 
est positive solutions to inequalities (14.4) and (14.7), Tes P GUNG. Then there are universal 


constants c < 1 < C such that, with probability at least 1 — ce 3 , we are guaranteed that 
ô, € [c5,, Con], and hence 


Co A 
Ma- ifie| s 23, forall fe 2. (14.8) 
See Proposition 14.25 in the Appendix (Section 14.5) for the details and proof. 


Theorem 14.1 is best understood by considering some concrete examples. 


Example 14.2 (Bounds for quadratic functions) For a given coefficient vector © € R?, 


define the quadratic function fg(x) := 6) + 6,x + x’, and let us consider the set of all 
bounded quadratic functions over the unit interval [—1, 1], that is, the function class 
Ph := {fo for some 6 € R? such that max xet-1,11 ICD] < 1}. (14.9) 


Suppose that we are interested in relating the population and empirical L?-norms uniformly 
over this family, when the samples are drawn from the uniform distribution over [—1, 1]. 
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We begin by exploring a naive approach, one that ignores localization and hence leads 
to a sub-optimal rate. From our results on VC dimension in Chapter 4—in particular, see 
Proposition 4.20—it is straightforward to see that P, has VC dimension at most 3. In con- 
junction with the boundedness of the function class, Lemma 4.14 guarantees that for any 


ô > 0, we have 
Oa oemt 2g JMEPED ii 
7 n 


for any set of samples {x;}}_;. As we will see, this upper bound is actually rather loose for 
small values of ô, since inequality (i) makes no use of the localization condition || foll2 < ô. 

Based on the naive upper bound (14.10), we can conclude that there is a constant co such 
that inequality (14.4) is satisfied with 6, = co 2)", Thus, for any t > co 2)", 
Theorem 14.1 guarantees that 


1 n 
sup [> 2, eif(xi| 
Ilfall2<6 


Mak- 1718] < IIb +P forall f eP, (14.11) 


with probability at least 1 — Geet, This bound establishes that || FIle and ||f||? are of the 
same order for all functions with norm ||f||2 > co( 2)" j but this order of fluctuation is 
sub-optimal. As we explore in Exercise 14.3, an entropy integral approach can be used to 
remove the superfluous logarithm from this result, but the slow n~!/4 rate remains. 

Let us now see how localization can be exploited to yield the optimal scaling n~". In 
order to do so, it is convenient to re-parameterize our quadratic functions in terms of an 
orthonormal basis of L?[—1, 1]. In particular, the first three functions in the Legendre basis 
take the form 


$o(x) = = $\(x) = ree and 2(x) = Fer- 1). 


By construction, these functions are orthonormal in L?[—1, 1], meaning that the inner prod- 
uct (¢;, orei, = E ġ;(x)ġr(x)dx is equal to one if j = k, and zero otherwise. Using 
these basis functions, any polynomial function in Pz then has an expansion of the form 
Sy) = Yobo(x) + yigi (x) + ¥2¢2(x), where || fl = Ilvllz by construction. Given a set of n 
samples, let us define an n x 3 matrix M with entries Mj; = ¢;(x;). In terms of this matrix, 
we then have 


n 


1 1 
E| a |> Deh < E| sup |;e"M7]| 
lA ll2< 


Efile"M]l] 


2 
n 
ty 
< = yEllle™MIK), 


where step (i) follows from the Cauchy—Schwarz inequality, and step (ii) follows from 
Jensen’s inequality and concavity of the square-root function. Now since the Rademacher 
variables are independent, we have 


F. [lle"M]Ż] = trace(MM"”) = trace(M™M). 
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By the orthonormality of the basis {¢9, ¢1, 62}, we have E,[trace(MTM)] = 3n. Putting to- 
gether the pieces yields the upper bound 


v36 
E| sup E » eif(xi|| < T 


lfllaeso 


Based on this bound, we see that there is a universal constant c such that inequality (14.4) 


is satisfied with 6, = ie Applying Theorem 4.10 then guarantees that for any t > Bt we 
have 
fle 1 
IMAI] <2 +5? forall fe Ps, (14.12) 


a bound that holds with probability at least 1 — ce". Unlike the earlier bound (14.11), 
this result has exploited the localization and thereby increased the rate from the slow one of 
(22) to the optimal one of (+)! & 


Whereas the previous example concerned a parametric class of functions, Theorem 14.1 
also applies to nonparametric function classes. Since metric entropy has been computed for 
many such classes, it provides one direct route for obtaining upper bounds on the solutions 
of inequalities (14.4) or (14.7). One such avenue is summarized in the following: 


Corollary 14.3 Let N,(t;B,(6; F )) denote the t-covering number of the set B, (6; F) = 
(f © F | fla < 6} in the empirical L?(P,,)-norm. Then the empirical version of critical 
inequality (14.7) is satisfied for any 6 > 0 such that 


64 (° & 
TF f. [iog Na(ts Bu(6s F)) dt < ©. (14.13) 


The proof of this result is essentially identical to the proof of Corollary 13.7, so that we leave 
the details to the reader. 

In order to make use of Corollary 14.3, we need to control the covering number N, in 
the empirical L?(P,,)-norm. One approach is based on observing that the covering number 
N, can always bounded by the covering number Nsup in the supremum norm ||- Ilo. Let us 
illustrate this approach with an example. 


Example 14.4 (Bounds for convex Lipschitz functions) Recall from Example 13.11 the 
class of convex 1-Lipschitz functions 


Fcon([0, 1]; 1) := {f : [0,1] > R | f(0) = 0, and f is convex and 1-Lipschitz}. 


From known results, the metric entropy of this function class in the sup-norm is upper 
bounded as log Neup(t; Fcon) SO 1/2 for all t > O sufficiently small (see the bibliographic 
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section for details). Thus, in order to apply Corollary 14.3, it suffices to find 6 > O such that 
= [awaz =o" x ð. 


Setting 6 = cn’? for a sufficiently large constant c > 0 is suitable, and applying Theo- 
rem 14.1 with this choice yields 


IfI — Ifl 


with probability greater than 1 — ce”, & 


<cn?? forall f € Foon(lO, 1]; 1) 


In the exercises at the end of this chapter, we explore various other results that can be derived 
using Corollary 14.3. 


14.1.2 Specialization to kernel classes 


As discussed in Chapter 12, reproducing kernel Hilbert spaces (RKHSs) have a number 
of attractive computational properties in application to nonparametric estimation. In this 
section, we discuss the specialization of Theorem 14.1 to the case of a function class F that 
corresponds to the unit ball of an RKHS. 

Recall that any RKHS is specified by a symmetric, positive semidefinite kernel function 
K: XxX > R. Under mild conditions, Mercer’s theorem (as stated previously in Theo- 
rem 12.20) ensures that K has a countable collection of non-negative eigenvalues (u Dj 
The following corollary shows that the population form of the localized Rademacher com- 
plexity for an RKHS is determined by the decay rate of these eigenvalues, and similarly, the 
empirical version is determined by the eigenvalues of the empirical kernel matrix. 


Corollary 14.5 Let F = {f € H | lfl < 1} be the unit ball of an RKHS with 
eigenvalues (Hj par Then the localized population Rademacher complexity (14.3) is 
upper bounded as 


R,(6; F) < ae (È maws a min{u;, 02}. (14.14a) 


Similarly, letting (Œ j);=ı denote the eigenvalues of the renormalized kernel matrix K € 
R” with entries Ki; = K(x;, x;)/n, the localized empirical Rademacher complex- 
ity (14.6) is upper bounded as 


R,(5; F) < Ne X mingi;, 62}. (14.14b) 
j=l 


< 


Given knowledge of the eigenvalues of the kernel (operator or matrix), these upper bounds 
on the localized Rademacher complexities allow us to specify values ô, that satisfy the in- 
equalities (14.4) and (14.7), in the population and empirical cases, respectively. Lemma 13.22 
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from Chapter 13 provides an upper bound on the empirical Gaussian complexity for a kernel 
class, which yields the claim (14.14b). The proof of inequality (14.14a) is based on tech- 
niques similar to the proof of Lemma 13.22; we work through the details in Exercise 14.4. 


Let us illustrate the use of Corollary 14.5 with some examples. 


Example 14.6 (Bounds for first-order Sobolev space) Consider the first-order Sobolev 
space 


H'[0, 1] :={f: [0,1] > R | f(0) = 0, and f is abs. cts. with f” € L”[0, 1]}. 


Recall from Example 12.16 that it is a reproducing kernel Hilbert space with kernel function 
K(x, z) = min{x, z}. From the result of Exercise 12.14, the unit ball {f € H'[0, 1] | |Ifllu < 1} 
is uniformly bounded with b = 1, so that Corollary 14.5 may be applied. Moreover, from 


Example 12.23, the eigenvalues of this kernel function are given by uj = (ofa) for j = 
1,2,.... Using calculations analogous to those from Example 13.20, it can be shown that 
zs S a) ao We 
yn j=l 


for some universal constant c’ > 0. Consequently, Corollary 14.5 implies that the critical 
inequality (14.4) is satisfied for 6, = cn~!/?. Applying Theorem 14.1, we conclude that 


sup [Ifl - Ifl 


Iflaipns! 


< con 3 


with probability greater than 1 — ceh, & 
Example 14.7 (Bounds for Gaussian kernels) Consider the RKHS generated by the Gaus- 
sian kernel K(x, z) = e7102? defined on the unit square [—1, 1] x [-1, 1]. As discussed in 
Example 13.21, there are universal constants (cg, c,) such that the eigenvalues of the associ- 
ated kernel operator satisfy a bound of the form 


Hj S Co ec ilog j for j =1,2,.... 


Following the same line of calculation as in Example 13.21, it is straightforward to show 


that inequality (14.14a) is satisfied by 6, = co 4/ eter for a sufficiently large but universal 
constant co. Consequently, Theorem 14.1 implies that, for the unit ball of the Gaussian kernel 


RKHS, we have 
Se: Hlog(n + 1) 
n 


with probability greater than 1 — 2e~°'!°8"*), By comparison to the parametric function 
class discussed in Example 14.2, we see that the unit ball of a Gaussian kernel RKHS obeys 
a uniform law with a similar rate. This fact illustrates that the unit ball of the Gaussian kernel 
RKHS—even though nonparametric in nature—is still relatively small. 4 


sup [Ifl - Ifl 
fll 
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14.1.3 Proof of Theorem 14.1 


Let us now return to prove Theorem 14.1. By a rescaling argument, it suffices to consider the 
case b = 1. Moreover, it is convenient to redefine ô, as a positive solution to the inequality 
ri o 
Ri(O; F) < TA (14.15) 


This new 6, is simply a rescaled version of the original one, and we shall use it to prove a 
version of the theorem with co = 1. 
With these simplifications, our proof is based on the family of random variables 


Z,Ar):= sap |ILflls — Is 
fEBz(r;F) 


, where Ba(r; F) ={fe F | Ifl <r} (14.16) 


indexed by r € (0,1]. We let & and &, respectively, denote the events that inequality 
(14.5a) or inequality (14.5b) are violated. We also define the auxiliary events Ao(r) := 
{Z,(r) = 17/2}, and 


Ay := {Z(llfllz) = On lflle for some f € F with |Ifll = dnt. 


The following lemma shows that it suffices to control these two auxiliary events: 


Lemma 14.8 For any star-shaped function class, we have 


(i) (ii) 
Eo E Alt) and Ei C Aol) UA. (14.17) 


Proof Beginning with the inclusion (i), we divide the analysis into two cases. First, sup- 
pose that there exists some function with norm ||fll2 < t that violates inequality (14.5a). 


For this function, we must have IZI — Isi] > a showing that Z,(t) > 5 so that Ap(t) 
must hold. Otherwise, suppose that the inequality (14.5a) is violated by some function with 
fll, > t. Any such function satisfies the inequality Izi - IL} > ||fII5/2. We may then de- 
fine the rescaled function f= ir f; by construction, it has ILfll2 = t, and also belongs to F 
due to the star-shaped condition. Hence, reasoning as before, we find that Ao(t) must also 
hold in this case. 

Turning to the inclusion (ii), it is equivalent to show that Aj(6,) N Aj S Ef. We split the 
analysis into two cases: 


Case 1: Consider a function f € F with ||fll2 < 6,. Then on the complement of Ao(6n), 
either we have || fl, < ôn, in which case IfI — ||fllo] < n, or we have || fl, = ôn, in which 
case 


MAB — Isi] No 
Ifl + Ifl ~ Sa 


Case 2: Next consider a function f € F with ||fll2 > 6,. In this case, on the complement 


Ifi- Ifll2| = Ôn: 
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of A,, we have 


Der 2. 
[MA NWI] lle bn 


[Wel = mae © We life ° 


which completes the proof. 


In order to control the events Ap(r) and A1, we need to control the tail behavior of the 
random variable Z,,(r). 


Lemma 14.9 For all r,s > 6,, we have 


2 EO, 
] < pp a 5-8 i (14.18) 


ro, S 
PIZ,(r) = re 


Setting both r and s equal to tf > 6, in Lemma 14.9 yields the bound P[Ap(t)] < 2e-©”". 
Using inclusion (i) in Lemma 14.8, this completes the proof of inequality (14.5a). 


Let us now prove Lemma 14.9. 


Proof Beginning with the expectation, we have 


n 


E[Z,(r)] 2 2E | sup È ` af œl] 2 4E| sup E oy eif] = 4R,(r), 


feB(r:F) N ZF SBF) N ZF 


where step (i) uses a standard symmetrization argument (in particular, see the proof of The- 
orem 4.10 in Chapter 4); and step (ii) follows from the boundedness assumption (||fllo < 1 
uniformly for all f € F) and the Ledoux—Talagrand contraction inequality (5.61) from 
Chapter 5. Given our star-shaped condition on the function class, Lemma 13.6 guarantees 
that the function r œ> R,(r)/r is non-increasing on the interval (0, co). Consequently, for any 


r > ôn, we have 


RAT) Gi Rn (ôn) o Ôn 

ro è 6, 16 

where step (ii) follows from the non-increasing property, and step (iv) follows from our 

definition of 6,,. Putting together the pieces, we find that the expectation is upper bounded 
as E[Z,(r)] < ou 

Next we establish a tail bound above the expectation using Talagrand’s inequality from 

Theorem 3.27. Let f be an arbitrary member of B2(r; F). Since ||fllo < 1 for all f € F, the 


recentered functions g = f? — E[f?(X)] are bounded as ||g||.0 < 1, and moreover 


var(g) < ELf*] < ELF] < r’, 


using the fact that f € B,(r; F). Consequently, by applying Talagrand’s concentration in- 
equality (3.83), we find that there is a universal constant c such that 


(14.19) 


P[Z,0) > EIZO) + Š] < 2 ex on < enema 5.2) 
ov. ey 417 i C(r? + rô, + s?)) ~ an 
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where the final step uses the fact that r > 6,. 


It remains to use Lemmas 14.8 and 14.9 to establish inequality (14.5b). By combining 
inclusion (ii) in Lemma 14.8 with the union bound, it suffices to bound the sum P[Ao(6,)] + 
PLA]. Setting r = s = 6, in the bound (14.18) yields the bound P[Ao(6,)] < emenn, 
whereas setting s? = rô, yields the bound 


P[Z,(r) > va] < 2672 (14.20) 


Given this bound, one is tempted to “complete” the proof by setting r = ||f|l2, and ap- 
plying the tail bound (14.20) to the variable Z,,(||fllz2). The delicacy here is that the tail 
bound (14.20) applies only to a deterministic radius r, as opposed to the random! radius || fll2. 
This difficulty can be addressed by using a so-called “peeling” argument. For m = 1,2,..., 
define the events 


Sm = if E€ F | 218, < IIfll2 < 2” Ont: 


Since ||fll2 < ||fllo < 1 by assumption, any function F N {||fll2 > 6,} belongs to some S,, 
for m € {1,2,..., M}, where M < 4log(1/6,). 

By the union bound, we have P(A) < Ai P(A, O Sm). Now if the event A, N Sm 
occurs, then there is a function f with ||fll2 < Fm := 2’"6, such that 


MAR- WAI] = Ifl Sn = Sr ndn- 
Consequently, we have P[S,, 9&1] < P[Z(rm) = 51mOn| < eenn, and putting together the 
pieces yields 


M 
2 2 
PLA] < Leo < e026, tog M <e i 
m=1 


where the final step follows from the assumed inequality nd, > log(4 log(1/6,)). 


14.2 A one-sided uniform law 


A potentially limiting aspect of Theorem 14.1 is that it requires the underlying function class 
to be b-uniformly bounded. To a certain extent, this condition can be relaxed by instead 
imposing tail conditions of the sub-Gaussian or sub-exponential type. See the bibliographic 
discussion for references to results of this type. 

However, in many applications—including the problem of nonparametric least squares 
from Chapter 13—it is the lower bound on ||f||? that is of primary interest. As discussed 
in Chapter 2, for ordinary scalar random variables, such one-sided tail bounds can often 
be obtained under much milder conditions than their corresponding two-sided analogs. 
Concretely, in the current context, for any fixed function f € F, applying the lower tail 
bound (2.23) to the i.i.d. sequence {f(x;)}'_, yields the guarantee 


PIILAIZ < FIÈ- 1] < @ E, (14.21) 


' It is random because the norm of the function f that violates the bound is a random variable. 
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Consequently, whenever the fourth moment can be controlled by some multiple of the sec- 
ond moment, then we can obtain non-trivial lower tail bounds. 

Our goal in this section is to derive lower tail bounds of this type that hold uniformly over 
a given function class. Let us state more precisely the type of fourth-moment control that is 
required. In particular, suppose that there exists a constant C such that 


ELF OD] < CELO] forall f € F with ||fl < 1. (14.22a) 


When does a bound of this type hold? It is certainly implied by the global condition 


EL] < CELOS forall fe F. (14.22b) 


However, as illustrated in Example 14.11 below, there are other function classes for which 
the milder condition (14.22a) can hold while the stronger condition (14.22b) fails. 


Let us illustrate these fourth-moment conditions with some examples. 


Example 14.10 (Linear functions and random matrices) For a given vector 6 € R7, de- 
fine the linear function f(x) = (x, 6), and consider the class of all linear functions Fin = 
{fo | 0 € R“}. As discussed in more detail in Example 14.13 to follow shortly, uniform laws 
for || fll? over such a function class are closely related to random matrix theory. Note that the 
linear function class Fiin is never uniformly bounded in a meaningful way. Nonetheless, it 
is still possible for the strong moment condition (14.22b) to hold under certain conditions 
on the zero-mean random vector x. 

For instance, suppose that for each 6 € R“, the random variable f(x) = (x, @) is Gaussian. 
In this case, using the standard formula (2.54) for the moments of a Gaussian random vector, 
we have E[f;(x)] = 3(E[ RON, showing that condition (14.22b) holds uniformly with 
C? = 3. Note that C does not depend on the variance of f(x), which can be arbitrarily 
large. Exercise 14.6 provides some examples of non-Gaussian variables for which the fourth- 
moment condition (14.22b) holds in application to linear functions. & 


Example 14.11 (Additive nonparametric models) Given a univariate function class Y, 
consider the class of functions on R? given by 


d 
Faa ={f: RE R | f = Dzi for some g; € FY}. (14.23) 


j=l 


The problem of estimating a function of this type is known as additive regression, and it 
provides one avenue for escaping the curse of dimension; see the bibliographic section for 
further discussion. 

Suppose that the univariate function class Y is uniformly bounded, say ||g jlo < b for all 
g; € Y, and consider a distribution over x € R? under which each g ;(x;) is a zero-mean ran- 
dom variable. (This latter assumption can always be ensured by a recentering step.) Assume 
moreover that the design vector x € R? has four-way independent components—that is, for 
any distinct quadruple (j,k, €,m), the random variables (xj, Xk, Xe, Xm) are jointly indepen- 
dent. For a given 6 € (0, 1], consider a function f = Sea gj € F such that E[ fF œ] = &, or 
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equivalently, using our independence conditions, such that 


d 
ELP OL =) Iigilld = &. 
j=l 


For any such function, the fourth moment can be bounded as 


d d 
ELF COL = E[( X gE] =D Erta +6 Y EENE] 
j=l j=l jtk 
d 
< >»; ELes(x))] + 65%, 


j=l 


where we have used the zero-mean property, and the four-way independence of the coor- 
dinates. Since ||gjllo < b for each g; € Y, we have E KED) < DE [s5(x))], and putting 
together the pieces yields 


FIFI] < b’ + 664 < (b? +6) 6°, 


where the final step uses the fact that 6 < 1 by assumption. Consequently, for any ô € (0, 1], 
the weaker condition (14.22a) holds with C? = b? + 6. & 


Having seen some examples of function classes that satisfy the moment conditions (14.22a) 
and/or (14.22b), let us now state a one-sided uniform law. Recalling that R, denotes the pop- 
ulation Rademacher complexity, consider the usual type of inequality 

RAF). _ 6 
ô ~ 128C’ 
where the constant C appears in the fourth-moment condition (14.22a). Our statement also 
involves the convenient shorthand B2(6) := {f € F | ||f lls < 6}. 


(14.24) 


Theorem 14.12 Consider a star-shaped class F of functions, each zero-mean under 
P, and such that the fourth-moment condition (14.22a) holds uniformly over F, and 
suppose that the sample size n is large enough to ensure that there is a solution 6, < 1 
to the inequality (14.24). Then for any 6 € [6,, 1], we have 


IAR = IAE for all f € F \ B8) (14.25) 


nô? 


with probability at least 1—e "2. 


Remark: The set F \ B2(ô) can be replaced with F whenever the set F N B2(6) is 
cone-like—that is, whenever any non-zero function f € B2(6) O ¥ can be rescaled by 
a := ô/|lfll2 > 1, thereby yielding a new function g := af that remains within F. 


In order to illustrate Theorem 14.12, let us revisit our earlier examples. 
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Example 14.13 (Linear functions and random matrices, continued) Recall the linear func- 
tion class Fyn introduced previously in Example 14.10. Uniform laws over this function 
class are closely related to earlier results on non-asymptotic random matrix theory from 
Chapter 6. In particular, supposing that the design vector x has a zero-mean distribution 
with covariance matrix X, the function f(x) = (x, 0) has L?(P)-norm 


lfl = F Elxx"]0 = || VEO; for each fy € F. (14.26) 


On the other hand, given a set of n samples {x;}?_,, we have 


lx 1 
ILfoll, =- > (xi, 0)” = —|[XAll3, (14.27) 
n n 
where the design matrix X € R”* has the vector x? as its ith row. Consequently, in applica- 
tion to this function class, Theorem 14.12 provides a uniform lower bound on the quadratic 
forms 1X02: in particular, as long as the sample size n is large enough to ensure that 6, < 1, 
we have 


1 1 
—||X6l|; > zll VEO forall 0 e R°. (14.28) 
n 


As one concrete example, suppose that the covariate vector x follows a N (0, X) distribu- 
tion. For any 0 € S%!, the random variable (x, 0) is sub-Gaussian with parameter at most 
Il VI, but this quantity could be very large, and potentially growing with the dimension 
d. However, as discussed in Example 14.10, the strong moment condition (14.22b) always 
holds with C? = 3, regardless of the size of || VĚlll2. In order to apply Theorem 14.12, we 
need to determine a positive solution ô, to the inequality (14.24). Writing each x = V&w, 
where w ~ N(0, X), note that we have || fo(x)|l2 = || V=Al>. Consequently, by definition of 
the local Rademacher complexity, we have 


= ly 1 
R, (ô; Ain) = :| sup ( EiWi, v=o) = ô E||- EWill2- 
geR4 | n 2, | n 2, 
II Vall <ô 


Note that the random variables {¢;w;}'_, are i.i.d. and standard Gaussian (since the sym- 
metrization by independent Rademacher variables has no effect). Consequently, previous 


results from Chapter 2 guarantee that E II X gwill < q: . Putting together the pieces, we 


conclude that 6? < 2, Therefore, for this particular ensemble, Theorem 14.12 implies that, 
as long as n X d, then 
2 
2 > H VEOJ forall o e R? (14.29) 
with high probability. The key part of this lower bound is that the maximum eigenvalue 
Il VÈll never enters the result. 

As another concrete example, the four-way independent and B-bounded random variables 
described in Exercise 14.6 also satisfy the moment condition (14.22b) with C? = B + 6. A 
similar calculation then shows that, with high probability, this ensemble also satisfies a lower 
bound of the form (14.29) where X = I4. Note that these random variables need not be sub- 
Gaussian—in fact, the condition does not even require the existence of moments larger than 
four. & 
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In Exercise 14.7, we illustrate the use of Theorem 14.12 for controlling the restricted eigen- 
values (RE) of some random matrix ensembles. 


Let us now return to a nonparametric example: 


Example 14.14 (Additive nonparametric models, continued) In this example, we return to 
the class Faq of additive nonparametric models previously introduced in Example 14.11. 
We let £„ be the critical radius for the univariate function class Y in the definition (14.23); 
thus, the scalar ¢, satisfies an inequality of the form R lE; F) < æ. In Exercise 14.8, we 
prove that the critical radius 6, for the d-dimensional additive family ¥,,4q satisfies the upper 
bound 6, < Vdé,. Consequently, Theorem 14.12 guarantees that 


IAE = Af for all f € Faa with |Ifllo > co Vd en (14.30) 


with probability at least 1 — e~¢"4*", 
As a concrete example, suppose that the univariate function class Y is given by a first- 
order Sobolev space; for such a family, the univariate rate scales as £? x n7?’ (see Exam- 


ple 13.20 for details). For this particular class of additive models, with probability at least 
1 = ec” We are guaranteed that 


d d 
1 
2 2 
pa ze 2 ligill (14.31) 
Sia NEEE 
IAI ale 
uniformly over all functions of the form f = 2 g; with |Ifll2 = Vdr”. & 


14.2.1 Consequences for nonparametric least squares 


Theorem 14.12, in conjunction with our earlier results from Chapter 13, has some immedi- 

ate corollaries for nonparametric least squares. Recall the standard model for nonparametric 

regression, in which we observe noisy samples of the form y; = f*(x;) + 7w;, where f* € F 

is the unknown regression function. Our corollary involves the local complexity of the 
shifted function class F* = F — f*. 

We let 6, and £, (respectively) be any positive solutions to the inequalities 

R,(6; F) G ô PE Gale; F*) © E€ l 

ô 128C E 20 

where the localized Gaussian complexity G,(£;.F*) was defined in equation (13.16), prior 

to the statement of Theorem 13.5. To be clear, the quantity ¢,, is a random variable, since it 


depends on the covariates {x;}”_,, which are modeled as random in this chapter. 


(14.32) 
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Corollary 14.15 Under the conditions of Theorems 13.5 and 14.12, there are univer- 
sal positive constants (Co, C1, C2) such that the nonparametric least-squares estimate f 
satisfies 


Puff- F = colh +0) < cre Oe (14.33) 


Proof We split the argument into two cases: 


Case 1: Suppose that 6, > ¢,. We are then guaranteed that ô, is a solution to inequality (ii) 
in equation (14.32). Consequently, we may apply Theorem 13.5 with t = ô, to find that 


P F- flh > 1662] < e. 
On the other hand, Theorem 14.12 implies that 
PeolllF — FIR > 262 +27 - fie] s e8@. 
Putting together the pieces yields that 
Po [F= FB = co] < ce CB, 


which implies the claim. 


Case 2: Otherwise, we may assume that the event A := {6, < €n} holds. Note that this event 
depends on the random covariates {x;}?_, via the random quantity €,. It suffices to bound the 
probability of the event 6 N A, where 


&:={Ilf — "lh = 1683 + 263}. 


In order to do so, we introduce a third event, namely 8 := {lf - f“ < 822}, and make note 
of the upper bound 


PIEN A] < PIE N B] + PLAN 89]. 
On one hand, we have 
PIEN B] < PIF- fll = WF- fh + 2] sere, 


where the final inequality follows from Theorem 14.12. 
On the other hand, let I[A] be a zero—one indicator for the event A := {6, < £n}. Then 
applying Theorem 13.5 with t = £, yields 


pansis Efe us] se”. 


Putting together the pieces yields the claim. 
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14.2.2 Proof of Theorem 14.12 


Let us now turn to the proof of Theorem 14.12. We first claim that it suffices to consider 
functions belonging to the boundary of the 5-ball—namely, the set 0B.(6) = {f € F | 
Ifl = 6}. Indeed, suppose that the inequality (14.25) is violated for some g € F with 
llgllz > 6. By the star-shaped condition, the function f := iE g belongs to F and has norm 


Ifl = 6. Finally, by rescaling, the inequality ||gl|? < 5|lgll3 is equivalent to ||f ||? < SILfIl3- 
For any function f € 0B2(6), it is equivalent to show that 


2 
q . 
In order to prove this bound, we make use of a truncation argument. For a level t > 0 to be 
chosen, consider the truncated quadratic 


3 
fl = SIAR - (14.34) 


u? if |u| <7, 
Pr(u) := i > : (14.35) 
t^ otherwise, 


and define f,(x) = sign(f(x)) V¢-(f(x)). By construction, for any f € OB2(6), we have 
ILAIE = ILfrll2, and hence 


IAR > MAB- sup [A - fb (14.36) 
feðB2(8) 


The remainder of the proof consists of showing that a suitable choice of truncation level 
T ensures that 


fells = IfI for all f € OBx(5) (14.37a) 
and 
PIZ > yo] <cee"™ where Z,:= sup [IAA -IAB (14.37b) 
JfeðB2(ô) 


These two bounds in conjunction imply that the lower bound (14.34) holds with probability 
at least 1 — c1e™®”” , uniformly all f with ||fll2 = ô. 


Proof of claim (14.37a): Letting I[[|f(«)| = rt] be a zero—one indicator for the event 
|f(x)| = T, we have 


IFB- Well < E| MO = r| < VELA] VPISI = T, 


where the last step uses the Cauchy—Schwarz inequality. Combining the moment bound 
(14.22a) with Markov’s inequality yields 


Fy f4 2 
Asa EO! < ce ME 


where the final inequality uses the moment bound (14.22a) again. Setting T? = 4C? yields 
the bound Male — TAIR < HIIG, which is equivalent to the claim (14.37a). 
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Proof of claim (14.37b): Beginning with the expectation, a standard symmetrization argu- 
ment (see Proposition 4.11) guarantees that 


EIZ] $2E cof sup_|- dae œ|]. 


fB; F) 'N 


Our truncation procedure ensures that f?(x) = y,(f(x)), where + is a Lipschitz function 
with constant L = 2r. Consequently, the Ledoux—Talagrand contraction inequality (5.61) 
guarantees that 


2 


1 fo) 
E [Zn <8 Lye | < 8rR,(6; F <8 
[Zn] < 87 [l Zaro (6: F) < 8T 


where the final step uses the assumed inequality R,(6;.-F) < THE: 
T = 2C ensures that E,[Z,,] < 10. 

Next we prove an upper tail bound on the random variable Z,,, in particular using Tala- 
grand’s theorem for empirical processes (Theorem 3.27). By construction, we have 
ILf27 lho < T” = 4C?, and 


var( fe(x)) < ELO < IIE = 4C? 8. 
Consequently, Talagrand’s inequality (3.83) implies that 


Our previous choice 


2 


P[Z, > E[Z,] +u] < cı B (14.38) 


Since E[Z,,] < Ẹ, the claim (14.37b) follows by setting u = È 


14.3 A uniform law for Lipschitz cost functions 


Up to this point, we have considered uniform laws for the difference between the empirical 
squared norm || f \| and its expectation ||f Iž. As formalized in Corollary 14.15, such results 
are useful, for example, in deriving bounds on the L?(P)-error of the nonparametric least- 
squares estimator. In this section, we turn to a more general class of prediction problems, 
and a type of uniform law that is useful for many of them. 


14.3.1 General prediction problems 


A general prediction problem can be specified in terms of a space X of covariates or predic- 
tors, and a space Y of response variables. A predictor is a function f that maps a covariate 
x € X to a prediction y = f(x) € Y. Here the space y may be either the same as the re- 
sponse space Y, or a superset thereof. The goodness of a predictor f is measured in terms 
of a cost function L: Y x Y — R, whose value L(y, y) corresponds to the cost of predicting 
y € Y when the underlying true response is some y € Y. Given a collection of n samples 
{(x;, yi}, a natural way in which to determine a predictor is by minimizing the empirical 
cost 


1 n 
PLEO, Y) = — J LF). y. (14.39) 
i=1 
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Although the estimator fis obtained by minimizing the empirical cost (14.39), our ulti- 
mate goal is in assessing its quality when measured in terms of the population cost function 


PLEO), Y) := Eryl LE, y), (14.40) 


and our goal is thus to understand when a minimizer of the empirical cost (14.39) is a near- 
minimizer of the population cost. 

As discussed previously in Chapter 4, this question can be addressed by deriving a suitable 
type of uniform law of large numbers. More precisely, for each f € F, let us define the 
function Ls: X x Y > R, via Ly(x,y) = L(f(x), y), and let us write 


P LA = PALEO), and Ly := PLA = PLEO, y). 


In terms of this convenient shorthand, our question can be understand as deriving a Glivenko— 
Cantelli law for the so-called cost class {Ly | f € F}. 

Throughout this section, we study prediction problems for which Y is some subset of the 
real line R. For a given constant L > 0, we say that the cost function £ is L-Lipschitz in its 
first argument if 


LE, Y) - LZ ys Le -zl (14.41) 


for all pairs z,z € Y and y € Y. We say that the population cost function f œ> P(L,) is 
y-strongly convex with respect to the L?(P)-norm at f* if there is some y > 0 such that 


A OL sY ‘ 
e| e a g-r)}>tu-re (14.42) 
ma LO-f*0) a 
LF), *(x).y . x- f" (x 
CAD LE) EO) 
for all f € F. Note that it is sufficient (but not necessary) for the function z => £(z, y) to 
be y-strongly convex in a pointwise sense for each y € Y. Let us illustrate these conditions 


with some examples. 


Example 14.16 (Least-squares regression) In a standard regression problem, the response 
space Y is the real line or some subset thereof, and our goal is to estimate a regression 
function x + f(x) € R. In Chapter 13, we studied methods for nonparametric regression 
based on the least-squares cost £(z, y) = t-z}. This cost function is not globally Lipschitz 
in general; however, it does become Lipschitz in certain special cases. For instance, consider 
the standard observation model y = f*(x) + € in the special case of bounded noise—say 
le| < c for some constant c. If we perform nonparametric regression over a b-uniformly 
bounded function class F, then for all f, g € F, we have 


LE,» — L(g(x), | = ilo — FOD)? S- & = g@)?| 
LP) — 8°00] + DI -gol 
< (b + (b + DF - g), 
so that the least squares satisfies the Lipschitz condition (14.41) with L = 2b + c. Of course, 
this example is rather artificial since it excludes any types of non-bounded noise variables e, 


including the canonical case of Gaussian noise. 
In terms of strong convexity, note that, for any y € R, the function z => 10-27 is strongly 


IA 
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convex with parameter y = 1, so that f +> Lp satisfies the strong convexity condition (14.42) 
with y = 1. 4 


Example 14.17 (Robust forms of regression) A concern with the use of the squared cost 
function in regression is its potential lack of robustness: if even a very small subset of obser- 
vations are corrupted, then they can have an extremely large effect on the resulting solution. 
With this concern in mind, it is interesting to consider a more general family of cost func- 
tions, say of the form 


Lzy) = YO- 2), (14.43) 


where ¥: R — [0,00] is a function that is a symmetric around zero with ¥(0) = 0, and 
almost everywhere differentiable with ||¥’||.. < L. Note that the least-squares cost fails to 
satisfy the required derivative bound, so it does not fall within this class. 

Examples of cost functions in the family (14.43) include the ¢,-norm Y; (u) = |u|, as well 
as Huber’s robust function 


i2 
— if |u| < 7T, 

YPFhuber(tt) = 2 T (14.44) 
tu — — otherwise, 


where t > 0 is a parameter to be specified. The Huber cost function offers some sort of 
compromise between the least-squares cost and the £1-norm cost function. 

By construction, the function Y+, is almost everywhere differentiable with ||P} llo < 1, 
whereas the Huber cost function is everywhere differentiable with ||Phuperllo < T. Conse- 
quently, the ¢,-norm and Huber cost functions satisfy the Lipschitz condition (14.41) with 
parameters L = 1 and L = q, respectively. Moreover, since the Huber cost function is locally 
equivalent to the least-squares cost, the induced cost function (14.43) is locally strongly 
convex under fairly mild tail conditions on the random variable y — f(x). 4 


Example 14.18 (Logistic regression) The goal of binary classification is to predict a label 
y € {-1, +1} on the basis of a covariate vector x € X. Suppose that we model the conditional 
distribution of the label y € {-1, +1} as 


1 
Pry |x) = TEDO’ (14.45) 


where f: X — R is the discriminant function to be estimated. The method of maximum 
likelihood then corresponds to minimizing the cost function 


L£y(x,y) = LA), y) = log (1 +e), (14.46) 


It is easy to see that the function £ is 1-Lipschitz in its first argument. Moreover, at the 
population level, we have 


1+ —2f(ayy 
PUL; - Lp) = Exy[ log ara] = EDP CDN PC], 


corresponding to the expected value of the Kullback—Leibler divergence between the two 
conditional distributions indexed by f* and f. Under relatively mild conditions on the be- 
havior of the random variable f(x) as f ranges over F, this cost function will be y-strongly 
convex. & 
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Example 14.19 (Support vector machines and hinge cost) Support vector machines are 
another method for binary classification, again based on estimating discriminant functions 
f: X — R. In their most popular instantiation, the discriminant functions are assumed to 


belong to some reproducing kernel Hilbert space H, equipped with the norm || - ||}. The 
support vector machine is based on the hinge cost function 
L(f (x), y) = max {0, 1 — yf}, (14.47) 


which is 1-Lipschitz by inspection. Again, the strong convexity properties of the population 
cost f +» P(L,) depend on the distribution of the covariates x, and the function class F 
over which we optimize. 

Given a set of n samples {(x;, y;)}/_,, a common choice is to minimize the empirical risk 


1 n 
PLEO, Y) = — $, max (0,1 = yif) 
i=1 


over a ball || fll, < R in some reproducing kernel Hilbert space. As explored in Exer- 


cise 12.20, this optimization problem can be reformulated as a quadratic program in n 
dimensions, and so can be solved easily. & 


14.3.2 Uniform law for Lipschitz cost functions 


With these examples as underlying motivation, let us now turn to stating a general uniform 
law for Lipschitz cost functions. Let f* € F minimize the population cost function f }> 
P(£p), and consider the shifted function class. 


F* :={f-f | fe F}. (14.48) 
Our uniform law involves the population version of the localized Rademacher complexity 
= 1 
R (6; F”) := a sup |- Dy £i s(x. (14.49) 
gege cae) 
IIgll2s6 


Theorem 14.20 (Uniform law for Lipschitz cost functions) Given a uniformly 1- 
bounded function class F that is star-shaped around the population minimizer f*, 
let & > € be any solution to the inequality 


R lê; F“) < 8. (14.50) 
(a) Suppose that the cost function is L-Lipschitz in its first argument. Then we have 
m PAE; = Lr) = PUEA Ly-)| 
jee IF- fil +6, 


with probability greater than 1 — cje~2"*, 
(b) Suppose that the cost function is L-Lipschitz and y-strongly convex. Then for any 


< 10L65, (14.51) 
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function fe F such that PL7- Lf) < 0, we have 
-rks (i)a (14.52a) 
and 
PLF- Lr) < 101 > + 2} (14.52b) 
where both inequalities hold with the same probability as in part (a). 


x 4 


Under certain additional conditions on the function class, part (a) can be used to guar- 
antee consistency of a procedure that chooses f € F to minimize the empirical cost f to 
P (L) over F. In particular, since f* € ¥ by definition, this procedure ensures that 


PAL; — Lp) < 0. Consequently, for any function class ¥ with? || - ||)-diameter at most 
D, the inequality (14.51) implies that 
P(Ly) < P(Le-) + 10L6, {2D + ôn} (14.53) 


with high probability. Thus, the bound (14.53) implies the consistency of the empirical cost 
minimization procedure in the following sense: up to a term of order 6, the value P(L7) is 
as small as the optimum P(L,-) = min jez P(Ly). 


Proof of Theorem 14.20 
The proof is based on an analysis of the family of random variables 
Zr) = sup |P»(Lyp- Lp) - PL - Lp)], 
Ilf-f*llesr 


where r > 0 is a radius to be varied. The following lemma provides suitable control on the 
upper tails of these random variables: 


Lemma 14.21 For eachr > 6,, the variable Z,(r) satisfies the tail bound 


2 


P[Z,(r) > 8Lrô, +u] < cı a-z) (14.54) 
r u 


Deferring the proof of this intermediate claim for the moment, let us use it to complete the 
proof of Theorem 14.20; the proof itself is similar to that of Theorem 14.1. Define the events 
&o := {Z,(ôn) = 9L67}, and 


E is {A fe F ||PaLy -— Lp) — PL- Lel = LOLS Ilf- fl and If- fll = on). 


If there is some function f € F that violates the bound (14.51), then at least one of 
the events E or & must occur. Applying Lemma 14.21 with u = Lô? guarantees that 
P[E ] < cge, Moreover, using the same peeling argument as in Theorem 14.1, we find 


2 A function class F has || - ||>-diameter at most D if ||fll2 < D for all f € F. In this case, we have 
If- fille < 2D. 
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that P[S,] < cje~”, valid for all 62 > <. Putting together the pieces completes the proof 
of the claim (14.51) in part (a). 

Let us now prove the claims in part (b). By examining the proof of part (a), we see that it 
actually implies that either ||f — f*ll2 < ôn, or 


PLF- Ly) - PLF- Lp)| < 10L5, If- f'l 


Since P,(L7- Lr) < 0 by assumption, we see that any minimizer must satisfy either the 
bound IF- Fille < On, or the bound P(L; —-Lyp) 10L6,I|f — f'llz. On one hand, if the 


former inequality holds, then so does inequality (14.52a). On the other hand, if the latter 
inequality holds, then, combined with the strong convexity condition (14.42), we obtain 
IF- Silo < T which also implies inequality (14.52a). 

In order to establish the bound (14.52b), we make use of inequality (14.52a) within the 
original inequality (14.51); we then perform some algebra, recalling that f satisfies the in- 
equality P, (L7 - Lp) < 0. 

It remains to prove Lemma 14.21. By a rescaling argument, we may assume that b = 1. In 
order to bound the upper tail of Z,(r), we need to control the differences £, — Lp uniformly 
over all functions f € F such that ||f — f*ll2 < r. By the Lipschitz condition on the cost 
function and the boundedness of the functions f, we have |L; — Ly-lo < Lilf — fill < 2L. 
Moreover, we have 


var(Ly— Lp) < PIL -LAPIS PIS- FGS EP, 


where inequality (i) follows from the Lipschitz condition on the cost function, and inequality 
(i) follows since ||f — f*ll2 < r. Consequently, by Talagrand’s concentration theorem for 


empirical processes (Theorem 3.27), we have 


2 


PIZ,(7) > 2EIZ,(7)] + u] < cı exp{- h (14.55) 


It remains to upper bound the expectation: in particular, we have 


i 1 
EIZ) £ 2E| l Yi af LEa), y) -LE 
=JSr j=1 
(i) 1< 
<4LE = a= fF xD) 
[ sue |- 2 e( fai) - FED] 
=4LR,(r; F*) 
2 4Lrô,, valid for all r > 6,, 


where step (i) follows from a symmetrization argument; step (ii) follows from the L-Lipschitz 
condition on the first argument of the cost function, and the Ledoux—Talagrand contraction 
inequality (5.61); and step (iii) uses the fact that the function r > aaa is non-increasing, 
and our choice of ô„. Combined with the tail bound (14.55), the proof of Lemma 14.21 is 
complete. 
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14.4 Some consequences for nonparametric density estimation 


The results and techniques developed thus far have some useful applications to the problem 
of nonparametric density estimation. The problem is easy to state: given a collection of i.i.d. 
samples {x;}?_,, assumed to have been drawn from an unknown distribution with density f*, 
how do we estimate the unknown density? The density estimation problem has been the 
subject of intensive study, and there are many methods for tackling it. In this section, we 
restrict our attention to two simple methods that are easily analyzed using the results from 
this and preceding chapters. 


14.4.1 Density estimation via the nonparametric maximum likelihood estimate 


Perhaps the most easily conceived method for density estimation is via a nonparametric 
analog of maximum likelihood. In particular, suppose that we fix some base class of densities 
F , and then maximize the likelihood of the observed samples over this class. Doing so leads 
to a constrained form of the nonparametric maximum likelihood estimate (MLE)—namely 


> E _f 1X 
f € arg min P„(-log f(x)) = arg min -: 3 log roo} (14.56) 


To be clear, the class of densities F must be suitably restricted for this estimator to be well 
defined, which we assume to be the case for the present discussion. (See Exercise 14.9 for an 
example in which the nonparametric MLE f fails to exist.) As an alternative to constraining 
the estimate, it also possible to define a regularized form of the nonparametric MLE. 

In order to illustrate the use of some bounds from this chapter, let us analyze the estima- 
tor (14.56) in the simple case when the true density f* is assumed to belong to F. Given 
an understanding of this case, it is relatively straightforward to derive a more general result, 
in which the error is bounded by a combination of estimation error and approximation error 
terms, with the latter being non-zero when f* ¢ F. 

For reasons to be clarified, it is convenient to measure the error in terms of the squared 
Hellinger distance. For densities f and g with respect to a base measure u, it is given by 


H*(f |g) := 5 { (F- va) du. (14.57a) 
X 


As we explore in Exercise 14.10, a useful connection here is that the Kullback—Leibler (KL) 
divergence is lower bounded by (a multiple of) the squared Hellinger distance—viz. 


D(f || g) > 2H°(f IIg). (14.57b) 


Up to a constant pre-factor, the squared Hellinger distance is equivalent to the L*(w)- 
norm difference of the square-root densities. For this reason, the square-root function class 
F ={g= Vf for some f € F} plays an important role in our analysis, as does the shifted 
square-root function class Y* := G — VF ; 

In the relatively simple result to be given here, we assume that there are positive constants 
(b, v) such that the square-root density class Y is Vb-uniformly bounded, and star-shaped 
around ./f*, and moreover that the unknown density f* € F is uniformly lower bounded 
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as 
f()>v>0 — forallxeX. 


In terms of the population Rademacher complexity &,, our result involves the critical in- 
equality 


R,(6;G") < 


7 (14.58) 
b+y 


With this set-up, we have the following guarantee: 


Corollary 14.22 Given a class of densities satisfying the previous conditions, let 6, 
be any solution to the critical inequality (14.58) such that & > (1 + 2) L, Then the 


nonparametric density estimate F satisfies the Hellinger bound 


HFIEF) < co 0 (14.59) 


with probability greater than 1 — cent Vn, 
= 


Proof Our proof is based on applying Theorem 14.20(b) to the transformed function class 
i Aaah ps 
= F 
É | 2f* 


equipped with the cost functions £L,(x) = —log h(x). Since F is b-uniformly bounded and 
f (x) 2 v for all x € X, for any h € H, we have 


< yE y V. 


Moreover, for any h € H, we have h(x) > 1/ V2 for all x € X and whence the mean value 
theorem applied to the logarithm, combined with the triangle inequality, implies that 


| 


Aloo = 
IAI IF 


L -GO| VZAD- forall x€ X, and h,h € H, 


showing that the logarithmic cost function is L-Lipschitz with L = V2. Finally, by construc- 


tion, for any h € H and with h* := BE = 1, we have 


(GO -e ( GE Ir). 


Therefore, the lower bound (14.57b) on the squared Hellinger distance in terms of the KL 
divergence is equivalent to asserting that P(L, — L;) 2 llk - h* I, meaning that the cost 
function is 2-strongly convex around h*. Consequently, the claim (14.59) follows via an 
application of Theorem 14.20(b). 


lh- A’ ||; = Ep 
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14.4.2 Density estimation via projections 


Another very simple method for density estimation is via projection onto a function class 
F. Concretely, again given n samples {x;}’_,, assumed to have been drawn from an unknown 
density f* on a space X, consider the projection-based estimator 


> ray eee Jbg 1X 
fe arg min {SE z PD} = emio SE So 2, fæ) . (14.60) 
For many choices of the underlying function class F, this estimator can be computed in 


closed form. Let us consider some examples to illustrate. 


Example 14.23 (Density estimation via series expansion) This is a follow-up on Exam- 
ple 13.14, where we considered the use of series expansion for regression. Here we consider 
the use of such expansions for density estimation—say, for concreteness, of univariate densi- 


ties supported on [0, 1]. For a given integer T > 1, consider a collection of functions Omt- iB 
taken to be orthogonal in L”[0, 1], and consider the linear function class 
T 
Froen(T) = {f =) BmGm |B ER, Br = 1}. (14.61) 
m=1 
As one concrete example, we might define the indicator functions 
1 ifxe(m-1 T 
MOE. en ns (14.62) 
0 otherwise. 


With this choice, an expansion of the form f = ry Bn@m(T) yields a piecewise constant 
function that is non-negative and integrates to 1. When used for density estimation, it is 
known as a histogram estimate, and is perhaps the simplest type of density estimate. 

Another example is given by truncating the Fourier basis previously described in Exam- 
ple 13.15. In this case, since the first function ¢,(x) = 1 for all x € [0, 1] and the remaining 
functions are orthogonal, we are guaranteed that the function expansion integrates to one. 
The resulting density estimate is known as a projected Fourier-series estimate. A minor point 
is that, since the sinusoidal functions are not non-negative, it is possible that the projected 
Fourier-series density estimate could take negative values; this concern could be alleviated 
by projecting the function values back onto the orthant. 

For the function class Fonho(T), the density estimate (14.60) is straightforward to com- 
pute: some calculation shows that 


as i we = 1 n 
fr = > Brides where Bn = F 2 Pin(Xj). (14.63) 


m=1 


For example, when using the histogram basis (14.62), the coefficient Bn corresponds to the 
fraction of samples that fall into the interval (m — 1, m]/T. When using a Fourier basis 
expansion, the estimate Bn corresponds to an empirical Fourier-series coefficient. In either 
case, the estimate fr is easy to compute. 

Figure 14.1 shows plots of histogram estimates of a Gaussian density N(1/2, (0.15)°), 
with the plots in Figure 14.1(a) and (b) corresponding to sample sizes n = 100 and n = 2000, 
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respectively. In addition to the true density in light gray, each plot shows the histogram 
estimate for T € {5,20}. By construction, each histogram estimate is piecewise constant, 
and the parameter T determines the length of the pieces, and hence how quickly the estimate 
varies. For sample size n = 100, the estimate with T = 20 illustrates the phenomenon of 
overfitting, whereas for n = 2000, the estimate with T = 5 leads to oversmoothing. 


Density estimation via histograms 
3.5 T T r 


Density estimation via histograms 


True density 
1 —T= 


3 F D'a --7T=20 


True density 


-0.2 0 0.2 0.4 0.6 0.8 1 1.2 
x value 


(b) 


Figure 14.1 Plots of the behavior of the histogram density estimate. Each plot shows 
the true function (in this case, a Gaussian distribution N(1/2, (0.15))) in light gray 
and two density estimates using T = 5 bins (solid line) and T = 20 bins (dashed 
line). (a) Estimates based on n = 100 samples. (b) Estimates based on n = 2000 
samples. 


Figure 14.2 shows some plots of the Fourier-series estimator for estimating the density 


My i /2 for x € [0,1/2], Wwe 
1/2 for xe (1/2, 1). 

As in Figure 14.1, the plots in Figure 14.2(a) and (b) are for sample sizes n = 100 and n = 
2000, respectively, with the true density f* shown in a gray line. The solid and dashed lines 
show the truncated Fourier-series estimator with T = 5 and T = 20 coefficients, respectively. 
Again, we see overfitting by the estimator with T = 20 coefficients when the sample size is 
small (n = 100). For the larger sample size (n = 2000), the estimator with T = 20 is more 
accurate than the T = 5 estimator, which suffers from oversmoothing. + 


Having considered some examples of the density estimate (14.60), let us now state a the- 
oretical guarantee on its behavior. As with our earlier results, this guarantee applies to the 
estimate based on a star-shaped class of densities , which we assume to be uniformly 
bounded by some b. Recalling that R, denotes the (localized) Rademacher complexity, we 
let 6, > 0 be any positive solution to the inequality R,(6; F) < E 
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Density estimation via projection 


Density estimation via projection 


2.5 F 


True density |4 
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(a) (b) 


Figure 14.2 Plots of the behavior of the orthogonal series density estimate (14.63) 
using Fourier series as the orthonormal basis. Each plot shows the true function f* 
from equation (14.64) in light gray, and two density estimates for T = 5 (solid line) 


and T = 20 (dashed line). (a) Estimates based on n = 100 samples. (b) Estimates 
based on n = 2000 samples. 


Corollary 14.24 There are universal constants c;, j = 0,1,2,3, such that for any 


density f* uniformly bounded by b, the density estimate (14.60) satisfies the oracle 
inequality 


IF- FIÈ < co inf If- FIÈ + 162 (14.65) 
JEF 


with probability at least 1 — c3 ™®"®. 


The proof of this result is very similar to our oracle inequality for nonparametric regression 
(Theorem 13.13). Accordingly, we leave the details as an exercise for the reader. 
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14.5 Appendix: Population and empirical Rademacher complexities 


Let 6, > 0 and 6, > 0 be the smallest positive solutions to the inequalities Ri(On) < OA 
and R,(ô,„) < 62, respectively. Note that these inequalities correspond to our previous defini- 
tions (14.4) and (14.7), with b = 1. (The general case b # 1 can be recovered by a rescaling 
argument.) In this appendix, we show that these quantities satisfy a useful sandwich relation: 


Proposition 14.25 For any 1-bounded and star-shaped function class F, the popu- 
lation and empirical radii satisfy the sandwich relation 


bn Oa Gi) 
a5 Ôn < 3ôn, (14.66) 


—=c2nô} 


with probability at least 1 — ce 
À~ 


Proof For each t > 0, let us define the random variable 


1 n 
Zn := Ee = i i)i I> 
(t) ee |> Dex] 


so that R,,(t) = E,[Z,,(t)] by construction. Define the events 


E E of Wel = WI] 1 
&o(t) := (zo - R (0) < w) and & := {a DATE < af: 


Note that, conditioned on &,, we have 


Ifl < J SILAS + 362 < 2llflle + 8n (14.67a) 
Ifl < J2Iflln + On < Fla + dns (14.67b) 


where both inequalities hold for all f € F. Consequently, conditioned on €, we have 


and 


z 1d = 
Z < E| sup |- oe F(x)|| = R 2t + ôn) (14.68a) 
fea. M 
IfllnS2t+6n 
and 
Rit) < Z,(2t + by). (14.68b) 


Equipped with these inequalities, we now proceed to prove our claims. 


Proof of upper bound (ii) in (14.66): Conditioned on the events &0(76,) and &,, we have 


< © = (i) 
Rpa 38n) < Zp(T5n) < RuTn) + 265, 


where step (i) follows from inequality (14.68b) with t = 36,, and step (ii) follows from 
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&0(76,). Since 76, > ôn, the argument used to establish the bound (14.19) guarantees that 
R,(76n) < 762. Putting together the pieces, we have proved that 


R, (38n) < 862 < (35). 


By definition, the quantity 6, is the smallest positive number satisfying this inequality, so 
that we conclude that 6,, < 36, as claimed. 


Proof of lower bound (i) in (14.66): Conditioning on the events &9(6,) and &;, we have 
= © — (i) ~ (iii) A 
On = R(n) < Zn(Gn) + 46, < R (38n) + $5, < 36ndn + $6, 
where step (i) follows &9(6,,), step (ii) follows from inequality (14.68a) with t = 6,, and step 
(iii) follows from the same argument leading to equation (14.19). Rearranging yields that 
162 < 36,0,, Which implies that 6, > 6,/4. 


Bounding the probabilities of E(t) and &;: On one hand, Theorem 14.1 implies that 
P[é{]< cen Pn 

Otherwise, we need to bound the probability P[6j(@6,,)] for an arbitrary constant œ > 1. 
In particular, our proof requires control for the choices œ = 1 and a = 7. From theorem 16 
of Bousquet et al. (2003), we have 


Zn(Q5n) = Ri(@6n) 


1 nað ) 


a? 
> —"| < 2exp — - 
8 | l 64 IR (asn) + E 


P[EG(a6,)] = P| 


For any a > 1, we have R,(a5,) > Ru(Sn) = 62, whence P[ES(a6,)] < 2e°2”*, 
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The localized forms of the Rademacher and Gaussian complexities used in this chapter 
are standard objects in mathematical statistics (Koltchinskii, 2001, 2006; Bartlett et al., 
2005). Localized entropy integrals, such as the one underlying Corollary 14.3, were in- 
troduced by van de Geer (2000). The two-sided results given in Section 14.1 are based on 
b-uniform boundedness conditions on the functions. This assumption, common in much 
of non-asymptotic empirical process theory, allows for the use of standard concentration 
inequalities for empirical processes (e.g., Theorem 3.27) and the Ledoux—Talagrand con- 
traction inequality (5.61). For certain classes of unbounded functions, two-sided bounds can 
also be obtained based on sub-Gaussian and/or sub-exponential tail conditions; for instance, 
see the papers (Mendelson et al., 2007; Adamczak, 2008; Adamczak et al., 2010; Mendel- 
son, 2010) for results of this type. One-sided uniform laws related to Theorem 14.12 have 
been proved by various authors (Raskutti et al., 2012; Oliveira, 2013; Mendelson, 2015). 
The proof given here is based on a truncation argument. 

Results on the localized Rademacher complexities, as stated in Corollary 14.5, can be 
found in Mendelson (2002). The class of additive regression models from Example 14.11 
were introduced by Stone (1985), and have been studied in great depth (e.g., Hastie and 
Tibshirani, 1986; Buja et al., 1989). An interesting extension is the class of sparse additive 
models, in which the function f is restricted to have a decomposition using at most s « d 
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univariate functions; such models have been the focus of more recent study (e.g., Meier 
et al., 2009; Ravikumar et al., 2009; Koltchinskii and Yuan, 2010; Raskutti et al., 2012). 

The support vector machine from Example 14.19 is a popular method for classification 
introduced by Boser et al. (1992); see the book by Steinwart and Christmann (2008) for fur- 
ther details. The problem of density estimation treated briefly in Section 14.4 has been the 
subject of intensive study; we refer the reader to the books (Devroye and Györfi, 1986; Sil- 
verman, 1986; Scott, 1992; Eggermont and LaRiccia, 2001) and references therein for more 
details. Good and Gaskins (1971) proposed a roughness-penalized form of the nonparamet- 
ric maximum likelihood estimate; see Geman and Hwang (1982) and Silverman (1982) for 
analysis of this and some related estimators. We analyzed the constrained form of the non- 
parametric MLE under the simplifying assumption that the true density f* belongs to the 
density class F. In practice, this assumption may not be satisfied, and there would be an 
additional form of approximation error in the analysis, as in the oracle inequalities discussed 
in Chapter 13. 


14.7 Exercises 


Exercise 14.1 (Bounding the Lipschitz constant) In the setting of Proposition 14.25, show 
that E| suppe lfl] < V5¢ for all £ > dp. 


Exercise 14.2 (Properties of local Rademacher complexity) Recall the localized Rade- 
macher complexity 


Rul) = Exel sup | yy eif], 


i=1 
ies 


and let ô, be the smallest positive solution to the anedualily. R,(d) < 6°. Assume that function 
class ¥ is star-shaped around the origin (so that f € F implies af € F for all a € [0, 1]). 


(a) Show that R,(s) < max {6, sôn}. (Hint: Lemma 13.6 could be useful.) 
(b) For some constant C > 1, let t, > 0 be the small positive solution to the inequality 
R,(t) < CÊ. Show that t, < 


Exercise 14.3 (Sharper rates via entropy integrals) In the setting of Example 14.2, show 
that there is a universal constant c’ such that 


I 
El D p av] se ot 
ae 


Exercise 14.4 (Uniform laws for kernel classes) In this exercise, we work through the 
proof of the bound (14.14a) from Corollary 14.5. 


(a) Letting (¢;)j-, be the eigenfunctions of the kernel operator, show that 
sup as fæl = = D 


IlfllH<1 
IIfll2s6 
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where z; := ))j-, €:@;(x;) and 


ODE DE 


(b) Defining the sequence 7; = min{6?, u j} for j = 1,2,..., show that D is contained within 
the ellipse & := {(@)7 ; | Sea Inj < 2}. 
(c) Use parts (a) and (b) to show that 


Lex i 0, ije 
ETAD Xefa] < JE Smo c 


IIfll2s6 


ER 
L 
Et 


Exercise 14.5 (Empirical approximations of kernel integral operators) Let K be a PSD 
kernel function satisfying the conditions of Mercer’s theorem (Theorem 12.20), and define 
the associated representer R,(-) = K(-, x). Letting H be the associated reproducing kernel 
Hilbert space, consider the integral operator Tx as defined in equation (12.1 1a). 


(a) Letting {x;}_, denote iid. samples from P, define the random linear operator Tx: 
H > H via 


= 1 1X 
Fre PAD i= 5 DR RIA) = 5 D FDR. 


i=1 


Show that E[Tx] = T. 
(b) Use techniques from this chapter to bound the operator norm 


Tx — Talley := sup I(T -TX Plu. 


fll! 


(c) Letting ¢; denote the jth eigenfunction of Tx, with associated eigenvalue u; > 0, show 
that 


ITx — Tall 


j 


ITx(ġ;) - Hj illu < 


Exercise 14.6 (Linear functions and four-way independence) Recall the class Fin of linear 
functions from Example 14.10. Consider a random vector x € R? with four-way independent 
components—i.e., the variables (xj, Xk, Xe, Xm) are independent for all distinct quadruples of 
indices. Assume, moreover, that each component has mean zero and variance one, and that 
F [x4] < B. Show that the strong moment condition (14.22b) is satisfied with C = B + 6. 


Exercise 14.7 (Uniform laws and sparse eigenvalues) In this exercise, we explore the use 
of Theorem 14.12 for bounding sparse restricted eigenvalues (see Chapter 7). Let X € R”4 
be a random matrix with i.i.d. N(0, X) rows. For a given parameter s > 0, define the function 
class Fepcone = {fo | lOl < Vsll@ll2}, where fo(x) = (x, 6). Letting p(X) denote the maximal 
diagonal entry of Ł, show that, as long as 
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for a sufficiently large constant c, then we are guaranteed that 


1 
IL falls 2 IL falls for all fo € Fspcone 
Sa 2 “SS 
IXO /n \VEalz 


with probability at least 1 — e~“'". Thus, we have proved a somewhat sharper version of 
Theorem 7.16. (Hint: Exercise 7.15 could be useful to you.) 


Exercise 14.8 (Estimation of nonparametric additive models) Recall from Example 14.11 
the class F gq of additive models formed by some base class Y that is convex and 1- 
uniformly bounded (||g||.. < 1 for all g € GY). Let 6, be the smallest positive solution to 
the inequality R,(6;.F) < 6°. Letting €, be the smallest positive solution to the inequality 
Riles F) < €, show that 6? x d e. 


Exercise 14.9 (Nonparametric maximum likelihood) Consider the nonparametric density 
estimate (14.56) over the class of all differentiable densities. Show that the minimum is 
not achieved. (Hint: Consider a sequence of differentiable approximations to the density 
function placing mass 1 /n at each of the data points.) 


Exercise 14.10 (Hellinger distance and Kullback—Leibler divergence) Prove the lower 
bound (14.57b) on the Kullback—Leibler divergence in terms of the squared Hellinger dis- 
tance. 


Exercise 14.11 (Bounds on histogram density estimation) Recall the histogram estimator 
defined by the basis (14.62), and suppose that we apply it to estimate a density f* on the 
unit interval [0,1] that is differentiable with ||f’llo < 1. Use the oracle inequality from 
Corollary 14.24 to show that there is a universal constant c such that 


IF- FIR sen? (14.69) 
with high probability. 


15 


Minimax lower bounds 


In the preceding chapters, we have derived a number of results on the convergence rates 
of different estimation procedures. In this chapter, we turn to the complementary question: 
Can we obtain matching lower bounds on estimation rates? This question can be asked both 
in the context of a specific procedure or algorithm, and in an algorithm-independent sense. 
We focus on the latter question in this chapter. In particular, our goal is to derive lower 
bounds on the estimation error achievable by any procedure, regardless of its computational 
complexity and/or storage. 

Lower bounds of this type can yield two different but complementary types of insight. 
A first possibility is that they can establish that known—and possibly polynomial-time— 
estimators are statistically “optimal”, meaning that they have estimation error guarantees that 
match the lower bounds. In this case, there is little purpose in searching for estimators with 
lower statistical error, although it might still be interesting to study optimal estimators that 
enjoy lower computational and/or storage costs, or have other desirable properties such as 
robustness. A second possibility is that the lower bounds do not match the best known upper 
bounds. In this case, assuming that the lower bounds are tight, one has a strong motivation 
to study alternative estimators. 

In this chapter, we develop various techniques for establishing such lower bounds. Of 
particular relevance to our development are the properties of packing sets and metric entropy, 
as discussed in Chapter 5. In addition, we require some basic aspects of information theory, 
including entropy and the Kullback—Leibler divergence, as well as other types of divergences 
between probability measures, which we provide in this chapter. 


15.1 Basic framework 


Given a class of distributions P, we let 6 denote a functional on the space P—that is, a 
mapping from a distribution P to a parameter (P) taking values in some space Q. Our goal 
is to estimate (P) based on samples drawn from the unknown distribution P. 

In certain cases, the quantity @(P) uniquely determines the underlying distribution P, 
meaning that 6(P 9) = @(P,) if and only if Po = P,. In such cases, we can think of 0 as 
providing a parameterization of the family of distributions. Such classes include most of 
the usual finite-dimensional parametric classes, as well as certain nonparametric problems, 
among them nonparametric regression problems. For such classes, we can write P = {Po | 
0 € Q}, as we have done in previous chapters. 

In other settings, however, we might be interested in estimating a functional P + @(P) 
that does not uniquely specify the distribution. For instance, given a class of distributions P 


485 


486 Minimax lower bounds 


on the unit interval [0, 1] with differentiable density functions f, we might be interested in 
estimating the quadratic functional P — 6(P) = ie (F'O dt € R. Alternatively, for a class 
of unimodal density functions f on the unit interval [0,1], we might be interested in esti- 
mating the mode of the density 6(P) = arg max;,<,o,1) f(x). Thus, the viewpoint of estimating 
functionals adopted here is considerably more general than a parameterized family of distri- 
butions. 


15.1.1 Minimax risks 


Suppose that we are given a random variable X drawn according to a distribution P for which 
(P) = 6. Our goal is to estimate the unknown quantity 6° on the basis of the data X. An 
estimator 6 for doing so can be viewed as a measurable function from the domain X of the 
random variable X to the parameter space Q. In order to assess the quality of any estimator, 
we let p: Q x Q = [0, œ) be a semi-metric,! and we consider the quantity p@, 6*). Here the 
quantity 6* is fixed but unknown, whereas the quantity © = O(X) is a random variable, so that 
p0, 6") is random. By taking expectations over the observable X, we obtain the deterministic 
quantity E plp(6, 6°). As the parameter 0* is varied, we obtain a function, typically referred 
to as the risk function, associated with the estimator. 

The first property to note is that it makes no sense to consider the set of estimators that 
are good in a pointwise sense. For any fixed 6*, there is always a very good way in which 
to estimate it: simply ignore the data, and return 6*. The resulting deterministic estimator 
has zero risk when evaluated at the fixed 6°, but of course is likely to behave very poorly 
for other choices of the parameter. There are various ways in which to circumvent this and 
related difficulties. The Bayesian approach is to view the unknown parameter 6* as a random 
variable; when endowed with some prior distribution, we can then take expectations over the 
risk function with respect to this prior. A closely related approach is to model the choice of 6° 
in an adversarial manner, and to compare estimators based on their worst-case performance. 
More precisely, for each estimator 6, we compute the worst-case risk supp.p E plo, A(P))], 
and rank estimators according to this ordering. The estimator that is optimal in this sense 
defines a quantity known as the minimax risk—namely, 


MOP); p) := inf sup Ep[p@, a(P))], (15.1) 
0 PEP 


where the infimum ranges over all possible estimators, by which we mean measurable func- 
tions of the data. When the estimator is based on n i.i.d. samples from P, we use W, to 
denote the associated minimax risk. 

We are often interested in evaluating minimax risks defined not by a norm, but rather by 
a squared norm. This extension is easily accommodated by letting ®: [0, co) — [0, 00) be an 
increasing function on the non-negative real line, and then defining a slight generalization 
of the p-minimax risk—namely 


MAP); © 0 p) := inf sup Ep[®(p@, a(P)))]. (15.2) 
0 PEP 


1 Tn our usage, a semi-metric satisfies all properties of a metric, except that there may exist pairs 6 # 6’ for 
which p(6, 6’) = 0. 
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A particularly common choice is ®(¢) = £, which can be used to obtain minimax risks for 
the mean-squared error associated with p. 


15.1.2 From estimation to testing 


With this set-up, we now turn to the primary goal of this chapter: developing methods for 
lower bounding the minimax risk. Our first step is to show how lower bounds can be obtained 
via “reduction” to the problem of obtaining lower bounds for the probability of error in a 
certain testing problem. We do so by constructing a suitable packing of the parameter space 
(see Chapter 5 for background on packing numbers and metric entropy). 

More precisely, suppose that {6',...,@”} is a 26-separated set? contained in the space 
O(P), meaning a collection of elements p(6/,6") > 26 for all j + k. For each 6’, let us 
choose some representative distribution Pgi—that is, a distribution such that @(P9) = 6/— 
and then consider the M-ary hypothesis testing problem defined by the family of distribu- 
tions {Pæ;, j =1,..., M}. In particular, we generate a random variable Z by the following 
procedure: 


(1) Sample a random integer J from the uniform distribution over the index set [M] := 
{1,...,M}. 
(2) Given J = j, sample Z ~ Py. 


We let Q denote the joint distribution of the pair (Z, J) generated by this procedure. Note 
that the marginal distribution over Z is given by the uniformly weighted mixture distribution 
Q := a Di Pa. Given a sample Z from this mixture distribution, we consider the M-ary 
hypothesis testing problem of determining the randomly chosen index J. A testing function 
for this problem is a mapping y: Z — [M], and the associated probability of error is given 
by QIY(Z) + J], where the probability is taken jointly over the pair (Z, J). This error proba- 
bility may be used to obtain a lower bound on the minimax risk as follows: 


Proposition 15.1 (From estimation to testing) For any increasing function ® and 
choice of 26-separated set, the minimax risk is lower bounded as 


MAP), D op) > O(6) 1 QIW(Z) # J], (15.3) 


where the infimum ranges over test functions. 


Note that the right-hand side of the bound (15.3) involves two terms, both of which depend 
on the choice of 6. By assumption, the function ® is increasing in 6, so that it is maximized 
by choosing 6 as large as possible. On the other hand, the testing error Q[W(Z) # J] is 
defined in terms of a collection of 26-separated distributions. As 6 — 0*, the underlying 
testing problem becomes more difficult, and so that, at least in general, we should expect 
that Q[w(Z) + J] grows as 6 decreases. If we choose a value 6* sufficiently small to ensure 


2 Here we enforce only the milder requirement p(6/, 6“) > 26, as opposed to the strict inequality required for a 
packing set. This looser requirement turns out to be convenient in later calculations. 
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that this testing error is at least 1/2, then we may conclude that Ni(@(P), ® o p) > 5(65"). 
For a given choice of 6, the other additional degree of freedom is our choice of packing set, 
and we will see a number of different constructions in the sequel. 


We now turn to the proof of the proposition. 


Proof For any P € P with parameter 6 = 6(P), we have 


~ 0 = Gi) a 
Ep[P(o(6, 8))] = (5) PIDO, 0) = D] = D) Ple(6, 4) = 8], 


where step (i) follows from Markov’s inequality, and step (ii) follows from the increasing 
nature of ®. Thus, it suffices to lower bound the quantity 


sup P[p(@, (P)) > ôl. 
PEP 


Recall that Q denotes the joint distribution over the pair (Z, J) defined by our construction. 
Note that 


oo 1% ae =f 
sup Plo(@,4(P)) > ôl > 35 2 Po LeO, 6’) > 5] = QIp@,@) > ô], 


so we have reduced the problem to lower bounding the quantity Qlp(@, 07) > 8]. 
Now observe that any estimator 8 can be used to define a test—namely, via 


= : eB 
y(Z) := arg Ne 9). (15.4) 


(If there are multiple indices that achieve the minimizing argument, then we break such 
ties in an arbitrary but well-defined way.) Suppose that the true parameter is 6’: we then 
claim that the event {o(6/ 6) < 6} ensures that the test (15.4) is correct. In order to see this 
implication, note that, for any other index k € [M], an application of the triangle inequality 


Figure 15.1 Reduction from estimation to testing using a 26-separated set in the 
space Q in the semi-metric p. If an estimator @ satisfies the bound p0, 6/) < 6 when- 
ever the true parameter is 6/, then it can be used to determine the correct index j in 
the associated testing problem. 
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guarantees that 


(8,0) = p(, 6) - p(0/,0) > 256-6 =6, 
CE aes 
>26 <6 

where the lower bound p(@/, 6) > 26 follows by the 26-separated nature of our set. Conse- 
quently, we have p(O, 6) > p(6/ O for all k + j, so that, by the definition (15.4) of our test, 
we must have y(Z) = j. See Figure 15.1 for the geometry of this argument. 

Therefore, conditioned on J = j, the event t0, 6/) < 6} is contained within the event 
{W(Z) = j}, which implies that Po; [0(0, 6/) > 6] = PoilW(Z) + j]. Taking averages over the 
index j, we find that 


a ie oh os 
Qo, 0") = 5] = z7 2 Poile@, 0) = ô] > QUZ) # JI. 
JEl 
Combined with our earlier argument, we have shown that 


sup Ep[®(p(6, 0] > DO) QIZ) # J]. 


Finally, we may take the infimum over all estimators @ on the left-hand side, and the infimum 
over the induced set of tests on the right-hand side. The full infimum over all tests can only 
be smaller, from which the claim follows. 


15.1.3 Some divergence measures 


Thus far, we have established a connection between minimax risks and error probabilities 
in testing problems. Our next step is to develop techniques for lower bounding the error 
probability, for which we require some background on different types of divergence mea- 
sures between probability distributions. Three such measures of particular importance are 
the total variation (TV) distance, the Kullback—Leibler (KL) divergence and the Hellinger 
distance. 


Let P and Q be two distributions on X with densities p and g with respect to some un- 
derlying base measure v. Note that there is no loss of generality in assuming the existence 
of densities, since any pair of distributions have densities with respect to the base measure 
y= 5(P +Q). The total variation (TV) distance between two distributions P and Q is defined 
as 


IP — Qllry := sup |P(A) — QUA). (15.5) 
ACX 
In terms of the underlying densities, we have the equivalent definition 
1 
IP - Qllrv = 5 f Ip(x) — q@)| (dx), (15.6) 
x 


corresponding to one-half the L'(v)-norm between the densities. (See Exercise 3.13 from 
Chapter 3 for details on this equivalence.) In the sequel, we will see how the total variation 
distance is closely connected to the Bayes error in binary hypothesis testing. 
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A closely related measure of the “distance” between distributions is the Kullback—Leibler 
divergence. When expressed in terms of the densities q and p, it takes the form 


D(QI| P) = faw log A ia; (15.7) 
X D(x) 


where v is some underlying base measure defining the densities. Unlike the total variation 
distance, it is not actually a metric, since, for example, it fails to be symmetric in its argu- 
ments in general (i.e., there are pairs for which D(Q||P) + D(P || Q)). However, it can be 
used to upper bound the TV distance, as stated in the following classical result: 


Lemma 15.2 (Pinsker—Csiszar—Kullback inequality) For all distributions P and Q, 
IP -— Qlrv < 4/4 D(QII P). (15.8) 


Recall that this inequality arose in our study of the concentration of measure phenomenon 
(Chapter 3). This inequality is also useful here, but instead in the context of establishing 
minimax lower bounds. See Exercise 15.6 for an outline of the proof of this bound. 


A third distance that plays an important role in statistical problems is the squared Hellinger 
distance, given by 


2 
HP IQ) := f (Voc) - V ) va. (15.9) 


It is simply the L7(v)-norm between the square-root density functions, and an easy calcula- 
tion shows that it takes values in the interval [0,2]. When the base measure is clear from the 
context, we use the notation H?(p || q) and H?(P || Q) interchangeably. 

Like the KL divergence, the Hellinger distance can also be used to upper bound the TV 
distance: 


Lemma 15.3 (Le Cam’s inequality) For all distributions P and Q, 


H?(P 
IP — Qhy < HP® f1- =A. (15.10) 


We work through the proof of this inequality in Exercise 15.5. 


Let (P1,...,P,) be a collection of n probability measures, each defined on X, and let 
PMs Q P; be the product measure defined on X”. If we define another product measure 
Q!” in a similar manner, then it is natural to ask whether the divergence between P!” and 
Q!” has a “nice” expression in terms of divergences between the individual pairs. 

In this context, the total variation distance behaves badly: in general, it is difficult to 
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express the distance ||P!" — Q'"||py in terms of the individual distances ||P; — Q;||ry. On the 
other hand, the Kullback—Leibler divergence exhibits a very attractive decoupling property, 
in that we have 


DP!” Q = X DPill Q). (15.1 1a) 
i=1 


This property is straightforward to verify from the definition. In the special case of 1.i.d. 
product distributions—meaning that P; = P; and Q; = Q; for all i—then we have 


D(P™" | Q™) = nDP1 || Qs). (15.11b) 
Although the squared Hellinger distance does not decouple in quite such a simple way, it 


does have the following property: 


n 


PP Q’") =1- | [(1- P:R). (15.12a) 


i=l 
Thus, in the i.i.d. case, we have 


$H?(P!" |. Q') = 1-(1- 4HP QD) < $nH?(P; Qi). (15.12b) 


See Exercises 15.3 and 15.7 for verifications of these and related properties, which play an 
important role in the sequel. 


15.2 Binary testing and Le Cam’s method 


The simplest type of testing problem, known as a binary hypothesis test, involves only two 
distributions. In this section, we describe the connection between binary testing and the 
total variation norm, and use it to develop various lower bounds, culminating in a general 
technique known as Le Cam’s method. 


15.2.1 Bayes error and total variation distance 


In a binary testing problem with equally weighted hypotheses, we observe a random variable 
Z drawn according to the mixture distribution Q := $Po + SP. For a given decision rule 
y: Z — {0,1}, the associated probability of error is given by 


QUZ) # J] = ZPolW(Z) # 0] + SPilWZ) # 1]. 


If we take the infimum of this error probability over all decision rules, we obtain a quantity 
known as the Bayes risk for the problem. In the binary case, the Bayes risk can actually 
be expressed explicitly in terms of the total variation distance ||P; — Pollrv, as previously 
defined in equation (15.5)—more precisely, we have 


inf QYZ) # J] = 3{1 - IIP1 — Polhry}. (15.13) 


Note that the worst-case value of the Bayes risk is one-half, achieved when P, = Po, so that 
the hypotheses are completely indistinguishable. At the other extreme, the best-case Bayes 
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risk is zero, achieved when ||P, — Pollry = 1. This latter equality occurs, for instance, when 
Po and P, have disjoint supports. 

In order to verify the equivalence (15.13), note that there is a one-to-one correspondence 
between decision rules y and measurable partitions (A, A‘) of the space X; more precisely, 
any decision rule y is uniquely determined by the set A = {x € X | W(x) = 1}. Thus, we have 


sup QIY(Z) = J] = sup {5P\(A) + $Po(A%)} = 5 sup {P1(A) — Po(A)} + 5. 
y ACX ACX 


Since sup, Qly(Z) = J] = 1 - inf; Q(Y(Z) + J], the claim (15.13) then follows from the 
definition (15.5) of the total variation distance. 

The representation (15.13), in conjunction with Proposition 15.1, provides one avenue 
for deriving lower bounds. In particular, for any pair of distributions Po, Pı € P such that 
p(A(Po), A(P1)) = 26, we have 


(6 
MAP), Bo p) 2 {1 — ||P: — Pollrv}. (15.14) 


Let us illustrate the use of this simple lower bound with some examples. 


Example 15.4 (Gaussian location family) For a fixed variance o°, let Pg be the distribution 
of a N(@, o°) variable; letting the mean @ vary over the real line defines the Gaussian location 
family {Po,0 € R}. Here we consider the problem of estimating 8 under either the absolute 
error je- 0| or the squared error (6- 6)? using a collection Z = (Y1, . . . , Y,) of n i.i.d. samples 
drawn from a N(6, o°) distribution. We use P7 to denote this product distribution. 

Let us apply the two-point Le Cam bound (15.14) with the distributions Pj and P}. We 
set 6 = 26, for some 6 to be chosen later in the proof, which ensures that the two means 
are 25-separated. In order to apply the two-point Le Cam bound, we need to bound the total 
variation distance ||P} — Po|lrv. From the second-moment bound in Exercise 15.10(b), we 
have 


IP} — Polley < {e — 1} = fete a, (15.15) 
Setting 6 = a thus yields 
—~ ô 1 ô lo 
inf EF gl | — Al] = 1 1;>-= 15.16 
inf sup allO — 8l] 2i 5 ve \ 6 1D va ( a) 
and 
~ 8 1 E 1P 
inf sup E,[(6 - 0°] > = 41 -= Ve- 1b > — = ——. 15.16b 
ne Sup al ( sz BNE lat An (15.16b) 


Although the pre-factors 1/12 and 1/24 are not optimal, the scalings o/ yn and o7/n are 
sharp. For instance, the sample mean @, := 1 È}; Y; satisfies the bounds 


= 2 = 2 
sup E,[|6, — 0l] = gees and sup Eg[(6, — 0°] = =a 
OER T yn OER n 
In Exercise 15.8, we explore an alternative approach, based on using the Pinsker—Csiszar— 
Kullback inequality from Lemma 15.2 to upper bound the TV distance in terms of the KL 
divergence. This approach yields a result with sharper constants. & 
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Mean-squared error decaying as n~! is typical for parametric problems with a certain type 
of regularity, of which the Gaussian location model is the archetypal example. For other 
“non-regular” problems, faster rates become possible, and the minimax lower bounds take a 
different form. The following example provides one illustration of this phenomenon: 


Example 15.5 (Uniform location family) Let us consider the uniform location family, 
in which, for each 0 € R, the distribution Ug is uniform over the interval [0, 6 + 1]. We 
let U; denote the product distribution of n i.i.d. samples from Uy. In this case, it is not 
possible to use Lemma 15.2 to control the total variation norm, since the Kullback—Leibler 
divergence between Uy and Uy is infinite whenever 6 + 6’. Accordingly, we need to use an 
alternative distance measure: in this example, we illustrate the use of the Hellinger distance 
(see equation (15.9)). 

Given a pair 6,6’ € R, let us compute the Hellinger distance between Uy, and Uy. By 
symmetry, it suffices to consider the case 6’ > 6. If 6’ > 0+1, then we have H?(U, || Uy) = 2. 
Otherwise, when 6’ € (6, 6 + 1], we have 


o +1 
H?(U,|| Us) = f dt +f dt =2|0' — 6l. 
0 0+1 


Consequently, if we take a pair 0,8 such that |? — 0| = 26 := 
guarantees that 


, then the relation (15.12b) 


L 
4n 


1 n 

= H*(U§ || Uj) < = 210 - 4 = 

5H Vall Us) < 5 21 | 
In conjunction with Lemma 15.3, we find that 


[Ug - Upllty < H?’ (U; I Ug) < 


i 
2° 

From the lower bound (15.14) with ®(f) = f°, we conclude that, for the uniform location 
family, the minimax risk is lower bounded as 


inf sup E,[(@ — 07] = = v) l 
nS 2 R8 m 


The significant aspect of this lower bound is the faster n~? rate, which should be contrasted 
with the n~! rate in the regular situation. In fact, this n7? rate is optimal for the uniform 
location model, achieved for instance by the estimator 6 = min{Y),..., Y,,}; see Exercise 15.9 
for details. & 


Le Cam’s method is also useful for various nonparametric problems, for instance those in 
which our goal is to estimate some functional 6: F — R defined on a class of densities F. 
For instance, a standard example is the problem of estimating a density at a point, say x = 0, 
in which case 6(f) := f(0) is known as an evaluation functional. 

An important quantity in the Le Cam approach to such problems is the Lipschitz constant 
of the functional 8 with respect to the Hellinger norm, given by 


wE;0,.F):= sup {lO f) - 8) | H(f lig) < e). (15.17) 


sup 
SEF 
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Here we use H?(f ||) to mean the squared Hellinger distance between the distributions 
associated with the densities f and g. Note that the quantity w measures the size of the fluc- 
tuations of 6(f) when f is perturbed in a Hellinger neighborhood of radius e. The following 
corollary reveals the importance of this Lipschitz constant (15.17): 


Corollary 15.6 (Le Cam for functionals) For any increasing function ® on the non- 
negative real line and any functional 0: F — R, we have 


~ 1_/1 1 
inf sup E] ®(6 — 6(f))| = zol; O 6, ¥ ) (15.18) 
Proof We adopt the shorthand w(t) = w(t; 0, F) throughout the proof. Setting € = +, 


choose a pair f, g that achieve? the supremum defining w(1/(2yn)). By a combination of 
Le Cam’s inequality (Lemma 15.3) and the decoupling property (15.12b) for the Hellinger 
distance, we have 


IP} — Pty < HCP" |P?) < nH (Pp IP.) < E. 


Consequently, Le Cam’s bound (15.14) with ô = 755) implies that 


oG- loll aA 
inf sup |o@ ERGE Ax) 


as claimed. 


The elegance of Corollary 15.6 is in that it reduces the calculation of lower bounds to 
a geometric object—namely, the Lipschitz constant (15.17). Some concrete examples are 
helpful to illustrate the basic ideas. 


Example 15.7 (Pointwise estimation of Lipschitz densities) Let us consider the family 
of densities on —5, 1] that are bounded uniformly away from zero, and are Lipschitz with 
constant one—that is, |f (x) — f(y)| < |x — y| for all x,y € [-4, 1. Suppose that our goal 
is to estimate the linear functional f œ> @(f) := f(0). In order to apply Corollary 15.6, it 


suffices to lower bound w(—-; 0, .¥) and we can do so by choosing a pair fọ,g € F with 


2yn > 
H°’(follg) = +, and then evaluating the difference |@(fo) — 6(8)|. Let fo = 1 be the uniform 
density on [-4, 5]. For a parameter 6 € (0, į] to be chosen, consider the function 
(x) = 4|x-—26|-6 for x € [6, 36], (15.19) 
0 otherwise. 


See Figure 15.2 for an illustration. By construction, the function ¢ is 1-Lipschitz, uniformly 


3 Tf the supremum is not achieved, then we can choose a pair that approximate it to any desired accuracy, and 
repeat the argument. 
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bounded with ||¢||, = 6 < H, and integrates to zero—that is, f e ġ(x)dx = 0. Conse- 


quently, the perturbed function g := fo + ¢ is a density function belonging to our class, and 
by construction, we have the equality |O(fo) — 6(g)| = 6 


Plot of hat function ¢ 


Function value 
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fo} So 
N -l 
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Figure 15.2 Illustration of the hat function ¢ from equation (15.19) for 6 = 0.12. It 
is 1-Lipschitz, uniformly bounded as ||@|l.0 < ô, and it integrates to zero. 


It remains to control the squared Hellinger distance. By definition, we have 


1/2 
1H’ (follg) = 1 -f ; v1 + oo) dt. 


Define the function Y(u) = V1 + u, and note that super [E (wW) < L, Consequently, by a 
Taylor-series expansion, we have 


1/2 1/2 
D= f O-o ars f {WOH + oC} d a520 
-1/2 -1/2 
Observe that 
1/2 1/2 
{ ġ(t)dt=0 and PA dt = af (6- x) dx = 4 
-1/2 -1/2 
Combined with our Taylor-series bound (15.20), we find that 
H*(follg) <34 = 36. 


Consequently, setting & = 3 ensures that H?(fo||g) < +. Putting together the pieces, 


Corollary 15.6 with ®() = £ implies that 


inf sup E |@- fO) Jas Ee > n°’, 


a JEF lg vn’ ~ 
This n~?’ lower bound for the Lipschitz family can be achieved by various estimators, so 
that we have derived a sharp lower bound. & 


We now turn to the use of the two-class lower bound for a nonlinear functional in a non- 
parametric problem. Although the resulting bound is non-trivial, it is not a sharp result— 
unlike in the previous examples. Later, we will develop Le Cam’s refinement of the two- 
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class approach so as to obtain sharp rates. 


Example 15.8 (Lower bounds for quadratic functionals) Given positive constants co < 1 
< cı and cz > 1, consider the class of twice-differentiable density functions 


1 
F({0, 1]) := G [0, 1] > [co, c1] | IIf’’llo < c2 and { f(x)dx = i} (15.21) 
0 


that are uniformly bounded above and below, and have a uniformly bounded second deriva- 
tive. Consider the quadratic functional f + @(f) := f (f’(x))* dx. Note that 6(f) provides a 
measure of the “smoothness” of the density: it is zero for the uniform density, and becomes 
large for densities with more erratic behavior. Estimation of such quadratic functionals arises 
in a variety of applications; see the bibliographic section for further discussion. 

We again use Corollary 15.6 to derive a lower bound. Let fọ denote the uniform distribu- 
tion on [0, 1], which clearly belongs to F>. As in Example 15.7, we construct a perturbation 
g of fo such that H7(fo || g) = Ł; Corollary 15.6 then gives a minimax lower bound of the 
order (0 fo) — Og)”. 

In order to construct the perturbation, let ¢: [0,1] — R be a fixed twice-differentiable 
function that is uniformly bounded as |løllo < L, and such that 


1 1 
f ġ¢(x)dx=0 and by := f (O dx>0  for£=0,1. (15.22) 
0 0 
Now divide the unit interval [0,1] into m sub-intervals [xj;,xj.1], with x; = i for j = 
0,...,m — 1. For a suitably small constant C > 0, define the shifted and rescaled functions 


(15.23) 
otherwise. 


C 
ġ;(x) | (m(x — x;)) if x € [xj Xj], 
0 


We then consider the density g(x) := 1 + X- @j(x). It can be seen that g € Fy as long 
as the constant C is chosen sufficiently small. See Figure 15.3 for an illustration of this 
construction. 

Let us now control the Hellinger distance. Following the same Taylor-series argument as 
in Example 15.7, we have 


sHUfollg) = -f Ssa S (Que i(2)) dx 
j=l 


AS px) dx 


1 
= cho, 
m 


where c > 0 is a universal constant. Consequently, the choice m* := 2c bon ensures that 
H? (foll g) < 4, as required for applying Corollary 15.6. 
It remains to evaluate the difference 6(f)) and 6(g). On one hand, we have 4(fo) = 0, 
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Figure 15.3 Illustration of the construction of the density g. Upper left: an exam- 
ple of a base function ¢. Upper right: function ¢; is a rescaled and shifted version 
of ¢. Lower left: original uniform distribution. Lower right: final density g is the 
superposition of the uniform density fọ with the sum of the shifted functions {¢ j= 7 


whereas on the other hand, we have 


Carat his PE 
o= f Xo) dx=m | (sa) dx = TA 
0 j=l 0 


Recalling the specified choice of m, we see that |@(g) — O(fo)| = £ for some universal 
constant K independent of n. Consequently, Corollary 15.6 with (ft) = t implies that 


sup EPOP) - AN & no”. (15.24) 
SERP 


This lower bound, while valid, is not optimal—there is no estimator that can achieve error 
of the order of n~'/? uniformly over F>. Indeed, we will see that the minimax risk scales 
as n-*/°, but proving this optimal lower bound requires an extension of the basic two-point 
technique, as we describe in the next section. 4 


15.2.2 Le Cam’s convex hull method 


Our discussion up until this point has focused on lower bounds obtained by single pairs of 
hypotheses. As we have seen, the difficulty of the testing problem is controlled by the total 
variation distance between the two distributions. Le Cam’s method is an elegant generaliza- 
tion of this idea, one which allows us to take the convex hulls of two classes of distributions. 
In many cases, the separation in total variation norm as measured over the convex hulls is 
much smaller than the pointwise separation between two classes, and so leads to better lower 
bounds. 

More concretely, consider two subsets Po and P, of P that are 26-separated, in the sense 
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that 


p(A(Po), ACP 1)) = 26 for all Po € Po and P; € Py. (15.25) 


Lemma 15.9 (Le Cam) For any 26-separated classes of distributions Py and P, con- 
tained within P, any estimator 6 has worst-case risk at least 


ô 
sup Er|p(@, 0P))| > a sup {l — |IPo — Pillry}. (15.26) 
fa oEconv(Po 

P,econv(P;) 


XM 


Proof For any estimator 6, let us define the random variables 
~ l, = ) 
V,(0) = a ant Pte. AP ;)), for j = 0,1. 


We then have 


sup Er[p@, A(P))] = 3 {Er,[o@, 4Po))] + Er, [o@, P1))]} 


PeP 


IV 


ô (Ep, [Vo] + Ep, [V1 ©}. 


Since the right-hand side is linear in Po and P;, we can take suprema over the convex hulls, 
and thus obtain the lower bound 
sup Ep[p(@,0(P))]>5 sup {Ep,[Vo()] + Ep, [V1)1}. 


PEP Poeconv(Po) 
Pi Econv(P1) 


By the triangle inequality, we have 
pO, A(Po)) + pO, OP1)) > p((Po), AP 1)) > 26. 
Taking infima over P; € P; for each j = 0, 1, we obtain 
Blue p(0, A(Po)) + Buus pO, A(P1)) = 26, 
which is equivalent to Vo(0) + V{(@) = 1. Since ZO) > 0 for j = 0,1, the variational 


representation of the TV distance (see Exercise 15.1) implies that, for any P; € conv(P;), 
we have 


Ep,[Vo(0)] + Ep,[Vi()] = 1- IP; — Polly, 


which completes the proof. 


In order to see how taking the convex hulls can decrease the total variation norm, it is 
instructive to return to the Gaussian location model previously introduced in Example 15.4: 


Example 15.10 (Sharpened bounds for Gaussian location family) In Example 15.4, we 
used a two-point form of Le Cam’s method to prove a lower bound on mean estimation in the 
Gaussian location family. A key step was to upper bound the TV distance ||P} — P6llrv be- 
tween the n-fold product distributions based on the Gaussian models N(6, 07) and N(0, o°), 
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respectively. Here let us show how the convex hull version of Le Cam’s method can be used 
to sharpen this step, so as obtain a bound with tighter constants. In particular, setting 0 = 26 
as before, consider the two families Py = {P5} and P, = {P}, P”,}. Note that the mixture dis- 
tribution P := 1P? + 4P”, belongs to conv(P). From the second-moment bound explored 
in Exercise 15.10(c), we have 


IP - Pay < HeC” — 1} = Heke 1h. (15.27) 


Setting 6 = ai for some parameter t > 0 to be chosen, the convex hull Le Cam bound (15.26) 
yields 


min sup E,[|6 — 6|] > aM t1—- = 5 Ver 

ee re aa D| ae 

This bound is an improvement over our original bound (15.16a) from Example 15.4, which 
has the pre-factor of $ = 0.08, as opposed to x = 0.15 obtained from this analysis. Thus, 
even though we used the same base separation 6, our use of mixture distributions reduced 
the TV distance—compare the bounds (15.27) and (15.15)—thereby leading to a sharper 
result. & 


In the previous example, the gains from extending to the convex hull are only in terms of 
the constant pre-factors. Let us now turn to an example in which the gain is more substan- 
tial. Recall Example 15.8 in which we investigated the problem of estimating the quadratic 
functional f = @(f) = for dx over the class F, from equation (15.21). Let us now 
demonstrate how the use of Le Cam’s method in its full convex hull form allows for the 
derivation of an optimal lower bound for the minimax risk. 


Example 15.11 (Optimal bounds for quadratic functionals) For each binary vector œ € 
{-1,+1}”, define the distribution P, with density given by 


fox) = 14D ajg. 
jJl 

Note that the perturbed density g constructed in Example 15.8 is a special member of this 
family, generated by the binary vector a = (1, 1,..., 1). Let P? denote the product distribu- 
tion on X” formed by sampling n times independently from P,, and define the two classes 
Po := {U"} and P; := {P}, œ € {-1, +1}”}. With these choices, we then have 

inf [Po — Pillry < IU” - Qliry < H(U"|| Q), 

P jeconv(P ;) 

j=0,1 

where Q := 2™ Yioe-1,41)m Ph is the uniformly weighted mixture over all 2” choices of P7. 
In this case, since Q is not a product distribution, we can no longer apply the decom- 

position (15.12a) so as to bound the Hellinger distance H(U" || Q) by a univariate version. 
Instead, some more technical calculations are required. One possible upper bound is given 
by 


m 1 
H(U"||Q) <n? Y ( Hi Pada). (15.28) 
j=l 
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Figure 15.4 Illustration of some densities of the form fo(x) = 1 + È= a jo (x) for 


different choices of sign vectors œ € {-1,1}”. Note that there are 2” such densities 
in total. 


See the bibliographic section for discussion of this upper bound as well as related results. 
If we take the upper bound (15.28) as given, then using the calculations from Example 15.8 
—in particular, recall the definition of the constants be from equation (15.22)—we find that 


2 


b n 
2 n De FO) «es 93D. 


Setting m’ = 4b5n’ yields that ||U'" — Q\lry < H(U'™ ||P!) < 1/2, and hence Lemma 15.9 
implies that 


2 
C à 5 pa: 


sup El(f) — 0P) = 8/4 = 5 
SEF m 


Thus, by using the full convex form of Le Cam’s method, we have recovered a better lower 
bound on the minimax risk (n~*/? >> n7'/?). This lower bound turns out to be unimprovable; 
see the bibliographic section for further discussion. + 


15.3 Fano’s method 


In this section, we describe an alternative method for deriving lower bounds, one based on a 
classical result from information theory known as Fano’s inequality. 
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15.3.1 Kullback—Leibler divergence and mutual information 


Recall our basic set-up: we are interested in lower bounding the probability of error in 
an M-ary hypothesis testing problem, based on a family of distributions {P»,..., Pov}. A 
sample Z is generated by choosing an index J uniformly at random from the index set 
[M] := {1,..., M}, and then generating data according to Py. In this way, the observation 
follows the mixture distribution Qz = Q := 4 wii Pei. Our goal is to identify the index J of 
the probability distribution from which a given sample has been drawn. 

Intuitively, the difficulty of this problem depends on the amount of dependence between 
the observation Z and the unknown random index J. In the extreme case, if Z were actually 
independent of J, then observing Z would have no value whatsoever. How to measure the 
amount of dependence between a pair of random variables? Note that the pair (Z, J) are in- 
dependent if and only if their joint distribution Qz, is equal to the product of its marginals— 
namely, QzQ;. Thus, a natural way in which to measure dependence is by computing some 
type of divergence measure between the joint distribution and the product of marginals. The 
mutual information between the random variables (Z, J) is defined in exactly this way, using 
the Kullback—Leibler divergence as the underlying measure of distance—that is 


I(Z, J) := D(Qz || Qz Q2). (15.29) 


By standard properties of the KL divergence, we always have (Z, J) > 0, and moreover 
I(Z, J) = 0 if and only if Z and J are independent. 

Given our set-up and the definition of the KL divergence, the mutual information can 
be written in terms of component distributions {Pq, j € [M]} and the mixture distribution 


Q = Qz—in particular as 
1% z 
WZ: J) = 7 2 DP oi ||, (15.30) 


corresponding to the mean KL divergence between Py; and Q, averaged over the choice of 
index j. Consequently, the mutual information is small if the distributions P; are hard to 
distinguish from the mixture distribution Q on average. 


15.3.2 Fano lower bound on minimax risk 


Let us now return to the problem at hand: namely, obtaining lower bounds on the minimax 

error. The Fano method is based on the following lower bound on the error probability in an 

M-ary testing problem, applicable when J is uniformly distributed over the index set: 

I(Z; J) + log2 
logM ` 

When combined with the reduction from estimation to testing given in Proposition 15.1, we 

obtain the following lower bound on the minimax error: 


PIZ) + J] > 1- (15.31) 
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Proposition 15.12 Let {0',...,@”} be a 26-separated set in the p semi-metric on 
O(P), and suppose that J is uniformly distributed over the index set {1,..., M}, and (Z | 
J = j) ~ Pø. Then for any increasing function ®©: [0, œ) > [0, co), the minimax risk is 
lower bounded as 


(15.32) 


WAP); D o p) > Dô) fı F C 


log M 


where I(Z; J) is the mutual information between Z and J. 


We provide a proof of the Fano bound (15.31), from which Proposition 15.12 follows, in 
the sequel (see Section 15.4). For the moment, in order to gain intuition for this result, 
it is helpful to consider the behavior of the different terms of 6 — 0*. As we shrink 6, 
then the 26-separation criterion becomes milder, so that the cardinality M = M(2ô) in the 
denominator increases. At the same time, in a generic setting, the mutual information /(Z; J) 
will decrease, since the random index J € [M(26)] can take on a larger number of potential 
values. By decreasing 6 sufficiently, we may thereby ensure that 


I(Z; J) + log2 Z 1 
log M ae 


(15.33) 


so that the lower bound (15.32) implies that W(P); Dop) > $@(6). Thus, we have a generic 
scheme for deriving lower bounds on the minimax risk. 

In order to derive lower bounds in this way, there remain two technical and possibly 
challenging steps. The first requirement is to specify 26-separated sets with large cardinality 
M(206). Here the theory of metric entropy developed in Chapter 5 plays an important role, 
since any 26-packing set is (by definition) 26-separated in the p semi-metric. The second 
requirement is to compute—or more realistically to upper bound—the mutual information 
I(Z; J). In general, this second step is non-trivial, but various avenues are possible. 

The simplest upper bound on the mutual information is based on the convexity of the 
Kullback—Leibler divergence (see Exercise 15.3). Using this convexity and the mixture rep- 
resentation (15.30), we find that 


1 M 
IZD -p ` D(P oi || Po). (15.34) 


jk=l 


Consequently, if we can construct a 26-separated set such that all pairs of distributions Pg 
and P are close on average, the mutual information can be controlled. Let us illustrate the 
use of this upper bound for a simple parametric problem. 


Example 15.13 (Normal location model via Fano method) Recall from Example 15.4 the 
normal location family, and the problem of estimating 6 € R under the squared error. There 
we showed how to lower bound the minimax error using Le Cam’s method; here let us derive 
a similar lower bound using Fano’s method. 

Consider the 26-separated set of real-valued parameters {6', 6’, 6°} = {0, 28, —28}. Since 


15.3 Fano’s method 503 


Poi = N(6/,07), we have 
lin lin) _ n j k\2 2nd? a 
DP |] Pg) = 553 -ØY < nE for all j,k = 1,2,3. 


The bound (15.34) then ensures that I(Z; Js) < me and choosing ô? = x ensures that 
2nd? /o +log 2 


TE < 0.75. Putting together the pieces, the Fano bound (15.32) with ®() = 7 
implies that 


oid 6 
sup Eo[(0 - 0°] = — = 
8ER 4 

In this way, we have re-derived a minimax lower bound of the order o7/n, which, as dis- 
cussed in Example 15.4, is of the correct order. & 


15.3.3 Bounds based on local packings 


Let us now formalize the approach that was used in the previous example. It is based on 
a local packing of the parameter space Q, which underlies what is called the “generalized 
Fano” method in the statistics literature. (As a sidenote, this nomenclature is very mislead- 
ing, because the method is actually based on a substantial weakening of the Fano bound, 
obtained from the inequality (15.34).) 

The local packing approach proceeds as follows. Suppose that we can construct a 26- 
separated set contained within © such that, for some quantity c, the Kullback—Leibler diver- 
gences satisfy the uniform upper bound 


VD(Poi || Pe) < c Vn for all j + k. (15.35a) 


The bound (15.34) then implies that [(Z; J) < c?n6’, and hence the bound (15.33) will hold 
as long as 


log M(26) > 2{ nð + log 2}. (15.35b) 


In summary, if we can find a 26-separated family of distributions such that conditions (15.35a) 
and (15.35b) both hold, then we may conclude that the minimax risk is lower bounded as 
MAP), Bo p) > 408). 


Let us illustrate the local packing approach with some examples. 


Example 15.14 (Minimax risks for linear regression) Consider the standard linear re- 
gression model y = X6* + w, where X € R”? is a fixed design matrix, and the vector 
w ~ N(0,c°I,) is observation noise. Viewing the design matrix X as fixed, let us obtain 


vn 9 


lower bounds on the minimax risk in the prediction (semi-)norm px, 6) := as- 


suming that 6* is allowed to vary over R. 
For a tolerance 6 > 0 to be chosen, consider the set 


{y € range(X) | Ilyll2 < 46 vn}, 


and let {y!, . . . , y” } be a 26-Vn-packing in the €-norm. Since this set sits in a space of dimen- 
sion r = rank(X), Lemma 5.7 implies that we can find such a packing with log M > r log2 
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elements. We thus have a collection of vectors of the form y/ = X@/ for some 6/ € Rf, and 
such that 


J 
Sas: neah je [M], (15.36a) 
vn 
i — Qk 
26 < XO" Olle < 86 for each j + k € [M] x [M]. (15.36b) 


< a 
Let Pa; denote the distribution of y when the true regression vector is @/; by the definition 


of the model, under Py, the observed vector y € R” follows a N(X6/,c7I,) distribution. 
Consequently, the result of Exercise 15.13 ensures that 


1 32n6* 
D(Poi || Pe) = zga XO - I < ee 


(15.37) 
where the inequality follows from the upper bound (15.36b). Consequently, for r sufficiently 


large, the lower bound (15.35b) can be satisfied by setting 6” = g 7 and we conclude that 


1 os o? rank(X) 
inf sup E| -IX - 6)||5| > — 
int sup [zi @- OK] = 8 
This lower bound is sharp up to constant pre-factors: as shown by our analysis in Example 
13.8 and Exercise 13.2, it can be achieved by the usual linear least-squares estimate. + 


Let us now see how the upper bound (15.34) and Fano’s method can be applied to a non- 
parametric problem. 


Example 15.15 (Minimax risk for density estimation) Recall from equation (15.21) the 
family F> of twice-smooth densities on [0,1], bounded uniformly above, bounded uni- 
formly away from zero, and with uniformly bounded second derivative. Let us consider 
the problem of estimating the entire density function f, using the Hellinger distance as our 
underlying metric p. 

In order to construct a local packing, we make use of the family of perturbed densities 
from Example 15.11, each of the form fo(x) = 1 + Èi ajġ;(x), where a € {-1,+1}” 
and the function ¢; was defined in equation (15.23). Although there are 2” such perturbed 
densities, it is convenient to use only a well-separated subset of them. Let M 4; H”) denote 
the ł-packing number of the binary hypercube {—1,+1}” in the rescaled Hamming metric. 
From our calculations in Example 5.3, we know that 


m 
log Mu(4; H”) > mD(4 || 5) = 10° 


(See in particular equation (5.3).) Consequently, we can find a subset T c {-1,+1}” with 
cardinality at least e”/!° such that 


di(a,B) = LS ta, #6) >1/4 — foralla# Bet. (15.38) 


J=l 


We then consider the family of M = e’”"!!° distributions {P,, œ € T}, where P, has density f,. 
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We first lower bound the Hellinger distance between distinct pairs fy and fg. Since ¢; is 
non-zero only on the interval J; = [x,;,xj41], we can write 


[ (a fina) e] f(a- ay os 
0 j=0 VL 


But on the interval J;, we have 


(VEO + Jo) = 2(falx) + fa) <4 


and therefore 


J (vac Jo) aces f (fa(x) - fal) 


2 fw dx whenever œj + fj. 
Ij 


Since JS, p(x) dx = if ¢°(x) dx = * and any distinct a + £ differ in at least m/4 positions, 
we find that H7(P,||P,) > 2% = 4 
separated set with 5? = 7%. 

Next we upper bound the pairwise KL divergence. By construction, we have fo(x) > 1/2 


for all x € [0, 1], and thus 


DP, eas f DA AL EO 


4 
2 f (Vx) - BO dx < 2, (15.39) 
0 


where the final inequality follows by a similar sequence of calculations. Overall, we have 
established the upper bound D(P; || P3) = nD(Pa || Ps) < 4bo% = 4n6*. Finally, we must 
ensure that 


= 46°. Consequently, we have constructed a 26- 


log M = —~ > 2 {4nd? + log 2} = 2 {4b + log 2}. 
10 mî 


This equality holds if we choose m = a for a sufficiently small constant C. With this 


choice, we have 6? = m+ = n~4/, and hence conclude that 


UA -4 
sup GIN Z a. 
feFr 
This rate is minimax-optimal for densities with two orders of smoothness; recall that we 
encountered the same rate for the closely related problem of nonparametric regression in 
Chapter 13. & 


As a third example, let us return to the high-dimensional parametric setting, and study 
minimax risks for the problem of sparse linear regression, which we studied in detail in 
Chapter 7. 
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Example 15.16 (Minimax risk for sparse linear regression) Consider the high-dimensional 
linear regression model y = X6* + w, where the regression vector 6* is known a priori to 
be sparse, say with at most s < d non-zero coefficients. It is then natural to consider the 
minimax risk over the set 


S“(s) := BECS) A B2(1) = {6 € R° | [Allo < s, llOll2 < 1} (15.40) 


of s-sparse vectors within the Euclidean unit ball. 

Let us first construct a 1/2-packing of the set S“(s). From our earlier results in Chapter 5 
(in particular, see Exercise 5.8), there exists a 1/2-packing of this set with log cardinality 
at least log M = 5 log es. We follow the same rescaling procedure as in Example 15.14 to 
form a 26-packing such that ||6/ — 6*||, < 46 for all pairs of vectors in our packing set. Since 
the vector 6/ — 6 is at most 2s-sparse, we have 


1 j Yos 
VD(Pa || Pe) = — IIX 0 - 9l < 46, 
(Pai ll Po) Vio! ( Vie oe 


where 72. := MaXrj25 Omax(Xr)/n. Putting together the pieces, we see that the minimax 
risk is lower bounded by any 6 > 0 for which 


d = 2 
tog 2 — > 128% 
2 sS 


oi nô’ + 2log 2. 


2 


As long as s < d/2 and s > 10, the choice 6” = Woy, s 


pieces, we conclude that in the range 10 < s < d/2, the minimax risk is lower bounded as 


log es suffices. Putting together the 


2 slog 4 
MSHI d & AS. (15.41) 
2s 


The constant obtained by this argument is not sharp, but this lower bound is otherwise unim- 
provable: see the bibliographic section for further details. 4 


15.3.4 Local packings with Gaussian entropy bounds 


Our previous examples have also used the convexity-based upper bound (15.34) on the mu- 
tual information. We now turn to a different upper bound on the mutual information, appli- 
cable when the conditional distribution of Z given J is Gaussian. 


Lemma 15.17 Suppose J is uniformly distributed over [M] = {1,...,M} and that Z 
conditioned on J = j has a Gaussian distribution with covariance &/. Then the mutual 
information is upper bounded as 


S| 


M 
(Z; J) < 5 fl det cov(Z) — : X log deh}. (15.42) 
j=l 
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This upper bound is a consequence of the maximum entropy property of the multivariate 
Gaussian distribution; see Exercise 15.14 for further details. In the special case when X/ = £ 
for all j € [M], it takes on the simpler form 


(15.43) 


IZI < 5 toe (SN) 


det(X) 


Let us illustrate the use of these bounds with some examples. 


Example 15.18 (Variable selection in sparse linear regression) Let us return to the model 
of sparse linear regression from Example 15.16, based on the standard linear model y = 
X6* + w, where the unknown regression vector 6* € R? is s-sparse. Here we consider the 
problem of lower bounding the minimax risk for the problem of variable selection—namely, 
determining the support set S = {j € {1,2,...,d} | 0; + 0}, which is assumed to have cardi- 
nality s < d. 

In this case, the problem of interest is itself a multiway hypothesis test—namely, that 
of choosing from all (‘) possible subsets. Consequently, a direct application of Fano’s in- 
equality leads to lower bounds, and we can obtain different such bounds by constructing 
various ensembles of subproblems. These subproblems are parameterized by the pair (d, s), 
as well as the quantity Omin = MIn jes 167. In this example, we show that, in order to achieve 
a probability of error below 1/2, any method requires a sample size of at least 


(15.44) 


i log (€ 
n> maxfa Leat D TE) 
) 


log + $e) log(1 + se 


as long as min log(d +s—1), log (A) > 4log2. 

For this problem, our observations consist of the response vector y € R” and design 
matrix X € R™. We derive lower bounds by first conditioning on a particular instantiation 
X = {x;}"_, of the design matrix, and using a form of Fano’s inequality that involves the 


mutual information /x(y; J) between the response vector y and the random index J with the 
design matrix X held fixed. In particular, we have 


Ix0; J) + log 2 
log M 


P[yQ, X) + J | X = Hak] > 1 


so that by taking averages over X, we can obtain lower bounds on P[y(y, X) # J] that in- 
volve the quantity Ex[/x(; J)]. 


Ensemble A: Consider the class M = (£) of all possible subsets of cardinality s, enumerated 
in some fixed way. For the ¢th subset S£, let 6° € R? have values 8min for all indices j € Sf, 
and zeros in all other positions. For a fixed covariate vector x; € R“, an observed response 
y; € R then follows the mixture distribution 4 pee ı Pee, where Pø is the distribution of a 
N(<x;, 6°), 07) random variable. 
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By the definition of mutual information, we have 


Ix; J) = HxO) - Hx(y | J) 
(i) 


<|>, xo) - Hx | J) 
i=1 


2S" xn) - Ax(y |} 


i=] 
= z Ix; J), (15.45) 
i=l 


where step (i) follows since independent random vectors have larger entropy than dependent 
ones (see Exercise 15.4), and step (ii) follows since (y1, . . . , Yn) are independent conditioned 
on J. Next, applying Lemma 15.17 repeatedly for each i € [n] with Z = y;, conditionally on 
the matrix X of covariates, yields 


REJE D ola | xi) 


Now taking averages over X and using the fact that the pairs (y;, x;) are jointly i.i.d., we find 
that 


x x0; D] < 5 F[log vas x1) $ n 108 Ly ee |x] 
o 2 F 


$ 


where the last inequality follows Jensen’s inequality, and concavity of the logarithm. 
It remains to upper bound the variance term. Since the random vector yı follows a mixture 
distribution with M components, we have 


oe 
Ly, [vary l xı)] < F [E [yy l xı]] = Ly [t De 8 Ox) + a°] 
1 M 
AA ; 
= trace (7 D(C 8?) +0. 


Now each index j € {1,2,...,d} appears in (ey of the total number of subsets M = (A; so 
that 


1 
trace ( G? 6 g 6’) _ gb “u den- = sOn- 


Putting together the pieces, we conclude that 


X n SOain 
Ex[x0 J)] < 5 Bll h 
and hence the Fano lower bound implies that 


5 log(1 + hn) +log2 


log (£ í 


Ply(y,X) # J] 2 1- 
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from which the first lower bound in equation (15.44) follows as long as log (‘) > 4log 2, as 
assumed. 


Ensemble B: Let 6 € R? be a vector with Omin in its first s — 1 coordinates, and zero in all 
remaining d — s + 1 coordinates. For each j = 1,...,d, let ej € R? denote the jth standard 
basis vector with a single one in position j. Define the family of M = d — s + 1 vectors 
GF := O+ Omine; for j = s,...,d. By a straightforward calculation, we have E[Y | x] = (x, y}, 
where y := 6 + F Omines—sds and the vector é,,7 € R? has ones in positions s through d, and 
zeros elsewhere. By the same argument as for ensemble A, it suffices to upper bound the 
quantity E,,[var(y, | xı)]. Using the definition of our ensemble, we have 


ee 
E, [vary | x1)] = o? + trace 4 — Se D0 -y8y)} <o + Oin (15.46) 
M = 
Recall that we have assumed that log(d — s + 1) > 4log 2. Using Fano’s inequality and the 
upper bound (15.46), the second term in the lower bound (15.44) then follows. 4 


Let us now turn to a slightly different problem, namely that of lower bounds for principal 
component analysis. Recall from Chapter 8 the spiked covariance ensemble, in which a 
random vector x € R is generated via 


xÉ We" +w. (15.47) 


Here v > 0 is a given signal-to-noise ratio, 6* is a fixed vector with unit Euclidean norm, 
and the random quantities € ~ N(0,1) and w ~ N(0, I4) are independent. Observe that 
the d-dimensional random vector x is zero-mean Gaussian with a covariance matrix of the 
form È := Iy + v(6* ®@ 6"). Moreover, by construction, the vector 6* is the unique maximal 
eigenvector of the covariance matrix X. 

Suppose that our goal is to estimate 6* based on n i.i.d. samples of the random vector x. In 
the following example, we derive lower bounds on the minimax risk in the squared Euclidean 
norm lio — 6" |I5. (As discussed in Chapter 8, recall that there is always a sign ambiguity in 
estimating eigenvectors, so that in computing the Euclidean norm, we implicitly assume that 
the correct direction is chosen.) 


Example 15.19 (Lower bounds for PCA) Let {A!,..., A} be a 1/2-packing of the unit 
sphere in R¢'; from Example 5.8, for all d > 3, there exists such a set with cardinality 
log M = (d — 1) log2 = d/2. For a given orthonormal matrix U € R“)*—» and tolerance 
ô € (0, 1) to be chosen, consider the family of vectors 


6(U) = V1 -6 | i 


+ô 
04-1 


for j € [M], (15.48) 


0 
vA 
where 04-1 denotes the (d — 1)-dimensional vector of zeros. By construction, each vector 
6/(U) lies on the unit sphere in Rf, and the collection of all M vectors forms a 6/2-packing 
set. Consequently, we can lower bound the minimax risk by constructing a testing problem 
based on the family of vectors (15.48). In fact, so as to make the calculations clean, we con- 
struct one testing problem for each choice of orthonormal matrix U, and then take averages 
over a randomly chosen matrix. 
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Let Paw) denote the distribution of a random vector from the spiked ensemble (15.47) 
with leading eigenvector 6* := 6/(U). By construction, it is a zero-mean Gaussian random 
vector with covariance matrix 


DU) := I; + (0U) @ 6/(U)). 


Now for a fixed U, suppose that we choose an index J € [M] uniformly at random, and then 
drawn n i.i.d. samples from the distribution Pew). Letting Z7(U) denote the samples thus 
obtained, Fano’s inequality then implies that the testing error is lower bounded as 


I(Zi(U); J) + log2 


PIZZU) # J|U) > 1 1/2 


(15.49) 
where we have used the fact that log M > d/2. For each fixed U, the samples Z} (U) are con- 
ditionally independent given J. Consequently, following the same line of reasoning leading 
to equation (15.45), we can conclude that [(Z7(U); J) < nI(Z(U); J), where Z(U) denotes a 
single sample. 

Since the lower bound (15.49) holds for each fixed choice of orthonormal matrix U, we 
can take averages when U is chosen uniformly at random. Doing so simplifies the task of 
bounding the mutual information, since we need only bound the averaged mutual informa- 
tion Ey[Z(Z(U); J)]. Since det(Z/(U)) = 1 + v for each j € [M], Lemma 15.17 implies that 


EyW(Z(U); J) < i Fy log det(cov(Z(U))) — log(1 + vy} 


< 4{ log det Ey(cov(Z(U))) - log(1 + v)}, (15.50) 
=r 


where the second step uses the concavity of the log-determinant function, and Jensen’s in- 
equality. Let us now compute the entries of the expected covariance matrix I. It can be seen 
that Ti; = 1 +v — v6"; moreover, using the fact that UA’ is uniformly distributed over the 
unit sphere in dimension (d — 1), the first column is equal to 


i< 
Peso = v6 V1 - 8 2 Ey[UA‘] = 0. 


Letting PFiow denote the lower square block of side length (d — 1), we have 


2 


ove ; ; ov 
Prow = Mii + FD EUA ewa] = (1+ 


YL. 


again using the fact that the random vector UA/ is uniformly distributed over the sphere in 
dimension d- 1. Putting together the pieces, we have shown that I = blkdiag (T11, Pow), and 
hence 


vo" 


d-1 


log det’ = (d - 1) log (1 + ) + log (1 +v- vô’). 


Combining our earlier bound (15.50) with the elementary inequality log(1 + £) < t, we find 
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that 
2Ey[MZU);.)] < d- 1) log (1 + 22) + 10g (1 - 6) 
d-1 l+y 
v 
2 ee ea 
_ vo 
l+y 


Taking averages over our earlier Fano bound (15.49) and using this upper bound on the aver- 
aged mutual information, we find that the minimax risk for estimating the spiked eigenvector 
in squared Euclidean norm is lower bounded as 


ve i}. 


1 
M(PCA; S™', I-16) 5 min{ _ 


In Corollary 8.7, we proved that the maximum eigenvector of the sample covariance achieves 
this squared Euclidean error up to constant pre-factors, so that we have obtained a sharp 
characterization of the minimax risk. % 


As a follow-up to the previous example, we now turn to the sparse variant of princi- 
pal components analysis. As discussed in Chapter 8, there are a number of motivations for 
studying sparsity in PCA, including the fact that it allows eigenvectors to be estimated at 
substantially faster rates. Accordingly, let us now prove some lower bounds for variable 
selection in sparse PCA, again working under the spiked model (15.47). 


Example 15.20 (Lower bounds for variable selection in sparse PCA) Suppose that our 

goal is to determine the scaling of the sample size required to ensure that the support set of 

an s-sparse eigenvector @* can be recovered. Of course, the difficulty of the problem depends 
1 


on the minimum value Omin = MIN jes 163]. Here we show that if Onin Z F then any method 
l+y 


requires n = -> slog(d — s + 1) samples to correctly recover the support. In Exercise 15.15, 
we prove a more general lower bound for arbitrary scalings of Onin. 

Recall our analysis of variable selection in sparse linear regression from Example 15.18: 
here we use an approach similar to ensemble B from that example. In particular, fix a subset 
S of size s — 1, and let £ € {-1, 1}¢ be a vector of sign variables. For each j € S° := [d] \ S, 
we then define the vector 


ar iffesS, 
[Pes ifl=j, 
0 otherwise. 
In Example 15.18, we computed averages over a randomly chosen orthonormal matrix U; 
here instead we average over the choice of random sign vectors €. 
Let Poe) denote the distribution of the spiked vector (15.47) with 6* = 6/(e), and let 


Z(e) be a sample from the mixture distribution a È jese Poe). Following a similar line of 
calculation as Example 15.19, we have 


E.[1(Z(e); J)] < 4{ log det (T) - log(1 + v)}, 


where F := EL, [cov(Z(e))] is the averaged covariance matrix, taken over the uniform dis- 
tribution over all Rademacher vectors. Letting E,_, denote a square matrix of all ones with 
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side length s — 1, a straightforward calculation yields that I is a block diagonal matrix with 

Tss = L-1 + *E,_; and Tses- = (1 + aT) Ty_s41. Consequently, we have 

s-1 
s 


N 


E [Z(e); J)] < log (1 +y ) +(d-s+ 1)log(1 4 ) — log(1 + v) 


os 
s(d—s+1) 


= log (1 - i) +(-54 Nlog(1+"_) 
eae 

~ l+v 

_ i y 

os lty 


Recalling that we have n samples and that log M = log(d — s — 1), Fano’s inequality implies 
that the probability of error is bounded away from zero as long as the ratio 


n y? 


slog(d-s+1)1+y 


is upper bounded by a sufficiently small but universal constant, as claimed. & 


15.3.5 Yang-Barron version of Fano’s method 


Our analysis thus far has been based on relatively naive upper bounds on the mutual infor- 
mation. These upper bounds are useful whenever we are able to construct a local packing of 
the parameter space, as we have done in the preceding examples. In this section, we develop 
an alternative upper bound on the mutual information. It is particularly useful for nonpara- 
metric problems, since it obviates the need for constructing a local packing. 


Lemma 15.21 (Yang-Barron method) Let Nx. (€; P) denote the e-covering number of 
P in the square-root KL divergence. Then the mutual information is upper bounded as 


(Z; J) < inf {€ + log Ngi (€; P)}. (15.51) 


Proof Recalling the form (15.30) of the mutual information, we observe that for any dis- 
tribution Q, the mutual information is upper bounded by 


1“ 01“ 
IZD = = VDP all S = Y DPI < max DPw IIQ), (15.52) 
j=l j=l 


1,...,.M 


where inequality (i) uses the fact that the mixture distribution Q := a Fri Po; minimizes the 
average Kullback—Leibler divergence over the family {Po ..., Pgv}—-see Exercise 15.11 for 
details. 

Since the upper bound (15.52) holds for any distribution Q, we are free to choose it: in 
particular, we let {y',...,y} be an e-covering of Q in the square-root KL pseudo-distance, 
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and then set Q = $ £}; Py. By construction, for each 6/ with j € [M], we can find some 
y* such that D(P,4i || Py) < €. Therefore, we have 


D(Ppi ||) h a 
gi = Ey|log m] 
ae dP ye 


ste e 
< Lgij log 
xdP 
= D(Pø || Px) + log N 
< e +logN. 


Since this bound holds for any choice of j € [M] and any choice of e > 0, the claim (15.51) 
follows. 


In conjunction with Proposition 15.12, Lemma 15.21 allows us to prove a minimax lower 
bound of the order 6 as long as the pair (ô, €) € RZ are chosen such that 


log M(6; p, Q) > 2{e? + log Nx (E; P) + log 2}. 
Finding such a pair can be accomplished via a two-step procedure: 
(A) First, choose 6, > 0 such that 
e > log Nx(&:3P). (15.53a) 


Since the KL divergence typically scales with n, it is usually the case that €? also grows 
with n, hence the subscript in our notation. 
(B) Second, choose the largest ô, > O that satisfies the lower bound 


log M(6,3p, Q) = 4e + 21og2. (15.53b) 


As before, this two-step procedure is best understood by working through some examples. 


Example 15.22 (Density estimation revisited) In order to illustrate the use of the Yang— 
Barron method, let us return to the problem of density estimation in the Hellinger metric, as 
previously considered in Example 15.15. Our analysis involved the class F>, as defined in 
equation (15.21), of densities on [0,1], bounded uniformly above, bounded uniformly away 
from zero, and with uniformly bounded second derivative. Using the local form of Fano’s 
method, we proved that the minimax risk in squared Hellinger distance is lower bounded 
as n™*/5, In this example, we recover the same result more directly by using known results 
about the metric entropy. 

For uniformly bounded densities on the interval [0,1], the squared Hellinger metric is 
sandwiched above and below by constant multiples of the L7({0, 1])-norm: 


1 
Ip alk = | 00-40 dx. 


Moreover, again using the uniform lower bound, the Kullback—Leibler divergence between 
any pair of distributions in this family is upper bounded by a constant multiple of the squared 
Hellinger distance, and hence by a constant multiple of the squared Euclidean distance. (See 
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equation (15.39) for a related calculation.) Consequently, in order to apply the Yang—Barron 
method, we need only understand the scaling of the metric entropy in the L?-norm. From 
classical theory, it is known that the metric entropy of the class F, in L?-norm scales as 
log N(6; Fa, ||- ll2) x (1/6)'/? for 6 > 0 sufficiently small. 


Step A: Given n i.i.d. samples, the square-root Kullback—Leibler divergence is multiplied 


by a factor of yn, so that the inequality (15.53a) can be satisfied by choosing €, > 0 such 
that 


ax (4)" 


In particular, the choice e? x n!” is sufficient. 


Step B: With this choice of €,, the second condition (15.53b) can be satisfied by choosing 
ôn > 0 such that 
1/2 
age 
Ôn i : 


or equivalently 62 = n~*/>. In this way, we have a much more direct re-derivation of the n~“/ 
lower bound on the minimax risk. & 


5 


As a second illustration of the Yang—Barron approach, let us now derive some minimax 
risks for the problem of nonparametric regression, as discussed in Chapter 13. Recall that 
the standard regression model is based on i.i.d. observations of the form 


yi = f(x) + ow;, fori =1,2,...,n, 


where w; ~ N(0, 1). Assuming that the design points {x;};_; are drawn in an i.i.d. fashion 
from some distribution P, let us derive lower bounds in the L7(P)-norm: 


If- fi = Hl [fiw - Fe] Paw. 


Example 15.23 (Minimax risks for generalized Sobolev families) For a smoothness pa- 
rameter œ > 1/2, consider the ellipsoid £ (N) given by 


Ey = {N21 | ye 6; <1}. (15.54a) 
j=l 
Given an orthonormal sequence (; fal in L7(P), we can then define the function class 
F, := {r= Semien <6. (15.54b) 
jl 


As discussed in Chapter 12, these function classes can be viewed as particular types of 
reproducing kernel Hilbert spaces, where œ corresponds to the degree of smoothness. For 
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any such function class, we claim that the minimax risk in squared L?(P)-norm is lower 
bounded as 
25 o\ 25 
inf sup Eff - fB = min {1 (—) a \, (15.55) 
n 


Í feFa 


and here we prove this claim via the Yang—Barron technique. 

Consider a function of the form f = } %2; 0;¢; for some @ € E(N), and observe that 
by the orthonormality of (¢; psp Parseval’s theorem implies that ||f Iê = È o. Conse- 
quently, based on our calculations from Example 5.12, the metric entropy of F, scales as 
log N(6; Fa, ||: |lz) x 1/6)!/". Accordingly, we can find a 6-packing {f!,..., f} of Fa in 
the || - ||2-norm with log M = (1/6)'/ elements. 


Step A: For this part of the calculation, we first need to upper bound the metric entropy 
in the KL divergence. For each j € [M], let Pp denote the distribution of y given {x;}7_, 
when the true regression function is f’, and let Q denote the n-fold product distribution 
over the covariates {x;}”_,. When the true regression function is f’, the joint distribution over 
O, {xi}_,) is given by Py; x Q, and hence for any distinct pair of indices j + k, we have 


1, 
DP px QIP p x Q = ELDE IP] = Els Dd - Fed] 
i=1 


n : 
= zall ~ fib. 
Consequently, we find that 


o V2 
yn 
where the final inequality again uses the result of Example 5.12. Consequently, inequal- 

ity (15.53a) can be satisfied by setting €? =x eo ae 


< oii 


log Nui (€) = log M(—— es Foll) 3 GD 


Step B: It remains to choose 6 > 0 to satisfy the inequality (15.53b). Given our choice of 
En and the scaling of the packing entropy, we require 


(1/6)"">c (3) ig 2og2} (15.56) 
Oo 


As long as n/c” is larger than some universal constant, the choice 5? x (2) satisfies the 
condition (15.56). Putting together the pieces yields the claim (15.55). & 


In the exercises, we explore a number of other applications of the Yang—Barron method. 


15.4 Appendix: Basic background in information theory 


This appendix is devoted to some basic information-theoretic background, including a proof 
of Fano’s inequality. The most fundamental concept is that of the Shannon entropy: it is a 
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functional on the space of probability distributions that provides a measure of their disper- 
sion. 


Definition 15.24 Let Q be a probability distribution with density q = a with respect 
to some base measure u. The Shannon entropy is given by 


HQ) = -Ellog 400] =~ | q 10g qud»), (15.57) 
X 


when this integral is finite. 
d 


The simplest form of entropy arises when Q is supported on a discrete set X, so that q 
can be taken as a probability mass function—hence a density with respect to the counting 
measure on X. In this case, the definition (15.57) yields the discrete entropy 


HQ) = - X 40) log q(x). (15.58) 
xEX 
It is easy to check that the discrete entropy is always non-negative. Moreover, when X is 
a finite set, it satisfies the upper bound H(Q) < log |X|, with equality achieved when Q is 
uniform over X. See Exercise 15.2 for further discussion of these basic properties. 


An important remark on notation is needed before proceeding: Given a random variable 
X ~ Q, one often writes H(X) in place of H(Q). From a certain point of view, this is abusive 
use of notation, since the entropy is a functional of the distribution Q as opposed to the ran- 
dom variable X. However, as it is standard practice in information theory, we make use of 
this convenient notation in this appendix. 


Definition 15.25 Given a pair of random variables (X, Y) with joint distribution Qyy, 
the conditional entropy of X | Y is given by 


H(X | Y) := Ey[H(Qyy)] = E f q(x | Y) log q(x | Yyu(dx)). (15.59) 
X 


We leave the reader to verify the following elementary properties of entropy and mutual 
information. First, conditioning can only reduce entropy: 


H(X | Y) < H(X). (15.60a) 


As will be clear below, this inequality is equivalent to the non-negativity of the mutual 
information /(X; Y). Secondly, the joint entropy can be decomposed into a sum of singleton 
and conditional entropies as 


H(X, Y) = H(Y) + H(X | Y). (15.60b) 
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This decomposition is known as the chain rule for entropy. The conditional entropy also 
satisfies a form of chain rule: 


H(X,Y |Z) = H(X |Z) + H(X | Y, Z). (15.60c) 


Finally, it is worth noting the connections between entropy and mutual information. By 
expanding the definition of mutual information, we see that 


I(X;Y) = H(X) + H(Y) - H(X, Y). (15.60d) 
By replacing the joint entropy with its chain rule decomposition (15.60b), we obtain 
K(X; Y) = H(Y)- H(Y | X). (15.60e) 


With these results in hand, we are now ready to prove the Fano bound (15.31). We do 
so by first establishing a slightly more general result. Introducing the shorthand notation 
qe = P[W(Z) # J], we let hA(qe) = -qe log qe — (1 — qe) log(1 — qe) denote the binary entropy. 
With this notation, the standard form of Fano’s inequality is that the error probability in any 
M-ary testing problem is lower bounded as 


h(qe) + Ge log(M — 1) > H(J | Z). (15.61) 
To see how this lower bound implies the stated claim (15.31), we note that 
H(J|Z)2 HW) - KZ; J) Ë log M - KZ; J), 


where equality (i) follows from the representation of mutual information in terms of entropy, 
and equality (ii) uses our assumption that J is uniformly distributed over the index set. Since 
h(qe) < log 2, we find that 


log2 + qe log M = log M — I(Z; J), 


which is equivalent to the claim (15.31). 

It remains to prove the lower bound (15.61). Define the {0, 1}-valued random variable 
V := I[W(Z) + J], and note that H(V) = h(qe) by construction. We now proceed to expand 
the conditional entropy H(V, J | Z) in two different ways. On one hand, by the chain rule, 
we have 


H(V,J|Z)=H(J|Z)+H(V|J,Z) = H(J | Z), (15.62) 


where the second equality follows since V is a function of Z and J. By an alternative appli- 
cation of the chain rule, we have 


H(V, J | Z) = H(V | Z) + HO | V, 2) < hlqe) + HV | V, 2), 


where the inequality follows since conditioning can only reduce entropy. By the definition 
of conditional entropy, we have 


H(J | V,Z) = P[V = AV | Z, V = 1) + P[V = 0]JH(J | Z, V = 0). 


If V = 0, then J = WZ), so that H(J | Z, V = 0) = 0. On the other hand, if V = 1, then we 
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know that J + Y(Z), so that the conditioned random variable (J | Z, V = 1) can take at most 
M — 1 values, which implies that 


H(J|Z,V = 1) < log(M - 1), 
since entropy is maximized by the uniform distribution. We have thus shown that 
H(V, J | Z) < h(qe) + log(M — 1), 
and combined with the earlier equality (15.62), the claim (15.61) follows. 


15.5 Bibliographic details and background 


Information theory was introduced in the seminal work of Shannon (1948; 1949); see also 
Shannon and Weaver (1949). Kullback and Leibler (1951) introduced the Kullback—Leibler 
divergence, and established various connections to both large-deviation theory and testing 
problems. Early work by Lindley (1956) also established connections between information 
and statistical estimation. Kolmogorov was the first to connect information theory and metric 
entropy; in particular, see appendix II of the paper by Kolmogorov and Tikhomirov (1959). 
The book by Cover and Thomas (1991) is a standard introductory-level text on information 
theory. The proof of Fano’s inequality given here follows their book. 

The parametric problems discussed in Examples 15.4 and 15.5 were considered in Le 
Cam (1973), where he described the lower bounding approach now known as Le Cam’s 
method. In this same paper, Le Cam also shows how a variety of nonparametric problems 
can also be treated by this method, using results on metric entropy. The paper by Hasmin- 
skii (1978) used the weakened form of the Fano method, based on the upper bound (15.34) 
on the mutual information, to derive lower bounds on density estimation in the uniform 
metric; see also the book by Hasminskii and Ibragimov (1981), as well as their survey 
paper (Hasminskii and Ibragimov, 1990). Assouad (1983) developed a method for deriv- 
ing lower bounds based on placing functions at vertices of the binary hypercube. See also 
Birgé (1983; 1987; 2005) for further refinements on methods for deriving both lower and up- 
per bounds. The chapter by Yu (1996) provides a comparison of both Le Cam’s and Fano’s 
method, as well Assouad’s method (Assouad, 1983). Examples 15.8, 15.11 and 15.15 fol- 
low parts of her development. Birgé and Massart (1995) prove the upper bound (15.28) on 
the squared Hellinger distance; see theorem 1 in their paper for further details. In their paper, 
they study the more general problem of estimating functionals of the density and its first k 
derivatives under general smoothness conditions of order a. The quadratic functional prob- 
lem considered in Examples 15.8 and 15.11 correspond to the special case with k = 1 and 
a = 2. The refined upper bound on mutual information from Lemma 15.21 is due to Yang 
and Barron (1999). Their work showed how Fano’s method can be applied directly with 
global metric entropies, as opposed to constructing specific local packings of the function 
class, as in the local packing version of Fano’s method discussed in Section 15.3.3. 

Guntuboyina (2011) proves a generalization of Fano’s inequality to an arbitrary f-diver- 
gence. See Exercise 15.12 for further background on f-divergences and their properties. His 
result reduces to the classical Fano’s inequality when the underlying f-divergence is the 
Kullback—Leibler divergence. He illustrates how such generalized Fano bounds can be used 
to derive minimax bounds for various classes of problems, including covariance estimation. 
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Lower bounds on variable selection in sparse linear regression using the Fano method, 
as considered in Example 15.18, were derived by Wainwright (2009a). See also the pa- 
pers (Reeves and Gastpar, 2008; Fletcher et al., 2009; Akcakaya and Tarokh, 2010; Wang 
et al., 2010) for further results of this type. The lower bound on variable selection in sparse 
PCA from Example 15.20 was derived in Amini and Wainwright (2009); the proof given 
here is somewhat more streamlined due to the symmetrization with Rademacher variables. 

The notion of minimax risk discussed in this chapter is the classical one, in which no ad- 
ditional constraints (apart from measurability) are imposed on the estimators. Consequently, 
the theory allows for estimators that may involve prohibitive computational, storage or com- 
munication costs to implement. A more recent line of work has been studying constrained 
forms of statistical minimax theory, in which the infimum over estimators is suitably re- 
stricted (Wainwright, 2014). In certain cases, there can be substantial gaps between the clas- 
sical minimax risk and their computationally constrained analogs (e.g., Berthet and Rigollet, 
2013; Ma and Wu, 2013; Wang et al., 2014; Zhang et al., 2014; Cai et al., 2015; Gao et al., 
2015). Similarly, privacy constraints can lead to substantial differences in the classical and 
private minimax risks (Duchi et al., 2014, 2013). 


15.6 Exercises 


Exercise 15.1 (Alternative representation of TV norm) Show that the total variation norm 
has the equivalent variational representation 


IP: - Poly =1- inf {Eolfol + Exif}. 


fot fizl 


where the infimum runs over all non-negative measurable functions, and the inequality is 
taken pointwise. 


Exercise 15.2 (Basics of discrete entropy) Let Q be the distribution of a discrete random 
variable on a finite set X. Letting q denote the associated probability mass function, its 
Shannon entropy has the explicit formula 


HQ) = HX) = - X q(x) log 40), 
xXEX 


where we interpret 0 log 0 = 0. 


(a) Show that H(X) > 0. 
(b) Show that H(X) < log |X|, with equality achieved when X has the uniform distribution 
over X. 


Exercise 15.3 (Properties of Kullback—Leibler divergence) In this exercise, we study some 
properties of the Kullback—Leibler divergence. Let P and Q be two distributions having 
densities p and q with respect to a common base measure. 


(a) Show that D(P ||Q) > O with equality if and only if the equality p(x) = g(x) holds 
P-almost everywhere. 
(b) Given a collection of non-negative weights such that X'+; A; = 1, show that 


DY" APOQ < Daj @ (15.63a) 
j=l j=l 
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and m m 
DQII A < Da IP). (15.63b) 
j= j= 
(c) Prove that the KL divergence satisfies the decoupling property (15.11a) for product mea- 
sures. 


Exercise 15.4 (More properties of Shannon entropy) Let (X, Y, Z) denote a triplet of ran- 
dom variables, and recall the definition (15.59) of the conditional entropy. 


(a) Prove that conditioning reduces entropy—that is, H(X | Y) < H(X). 
(b) Prove the chain rule for entropy: 


H(X,Y, Z) = H(X) + H(Y | X) + H(Z | Y, X). 
(c) Conclude from the previous parts that 
H(X,Y, Z) < H(X) + H(Y) + H(Z), 
so that joint entropy is maximized by independent variables. 


Exercise 15.5 (Le Cam’s inequality) Prove the upper bound (15.10) on the total variation 
norm in terms of the Hellinger distance. (Hint: The Cauchy—Schwarz inequality could be 
useful.) 


Exercise 15.6 (Pinsker—Csiszar—Kullback inequality) In this exercise, we work through a 
proof of the Pinsker—Csiszar—Kullback inequality (15.8) from Lemma 15.2. 


(a) When P and Q are Bernoulli distributions with parameters 6, € [0,1] and 6, € [0, 1], 
show that inequality (15.8) reduces to 


2 Op 1-6, 
2 (6p — ôq) < 6, log — + (1 — 6,) log š (15.64) 
Ôq 1-6, 
Prove the inequality in this special case. 
(b) Use part (a) and Jensen’s inequality to prove the bound in the general case. (Hint: Letting 
p and q denote densities, consider the set A := {x € X | p(x) = q(x)}, and try to reduce 
the problem to a version of part (a) with ô, = P[A] and 6, = Q[A].) 


Exercise 15.7 (Decoupling for Hellinger distance) Show that the Hellinger distance satis- 
fies the decoupling relation (15.12a) for product measures. 


Exercise 15.8 (Sharper bounds for Gaussian location family) Recall the normal location 
model from Example 15.4. Use the two-point form of Le Cam’s method and the Pinsker- 
Csiszar—Kullback inequality from Lemma 15.2 to derive the sharper lower bounds 
— lo ~ 1 o? 
inf sup E,[|@— |] > =— and inf sup E[(0 - 0] > — —. 
ue lO — Al] Aer aay OO) Tae 


Exercise 15.9 (Achievable rates for uniform shift family) In the context of the uniform 
shift family (Example 15.5), show that the estimator 8 = min{Y;,..., Y,,} satisfies the bound 
SUPyer ELO - 0°] < 3. 
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Exercise 15.10 (Bounds on the TV distance) 


(a) Prove that the squared total variation distance is upper bounded as 


ey eee p(x) B 
IIP TERA J v(dx) i} 


where p and q are densities with respect to the base measure v. 
(b) Use part (a) to show that 


ve)? 
Pio Pooley < Her) = 1}, (15.65) 


where, for any y € R”, we use P? , to denote the n-fold product distribution of a N(y, a’) 
variate. 
(c) Use part (a) to show that 


S 1f vne 4 
IIP - Ro lko < Heal) = 1}, (15.66) 


where P = Py, + ip" go 1S a mixture distribution. 
K Kog 


Exercise 15.11 (Mixture distributions and KL divergence) Given a collection of distribu- 
tions {P}, . .. , Pm}, consider the mixture distribution Q = 4 yal P ;. Show that 


ere! i. Ee 
m 2 PEIA < m PENA 


for any other distribution Q. 


Exercise 15.12 (f-divergences) Let f: R} — R be a strictly convex function. Given two 
distributions P and Q (with densities p and q, respectively), their f-divergence is given by 


DPIQ := ii IFP). (15.67) 


(a) Show that the Kullback—Leibler divergence corresponds to the f-divergence defined by 
fO = tlogt. 

(b) Compute the f-divergence generated by f(t) = —log(t). 

(c) Show that the squared Hellinger divergence H?(P || Q) is also an f-divergence for an 
appropriate choice of f. 

(d) Compute the f-divergence generated by the function f(t) = 1— Vt. 


Exercise 15.13 (KL divergence for multivariate Gaussian) For j = 1,2, let Q; be a d- 
variate normal distribution with mean vector u; € R? and covariance matrix £; > 0. 


(a) If X; = X, = X, show that 
D(Qi || Qo) = 44u — po, Er — py). 


(b) In the general setting, show that 


det(X)) 


DQ: IIQ) = Leu = uo, E3' (u — by) + log det(,) 


+ trace (252) - ad}. 
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Exercise 15.14 (Gaussian distributions and maximum entropy) For a given a > 0, let Q, 
be the class of all densities g with respect to Lebesgue measure on the real line such that 
f a xq(x) = 0, and f a q(x)x? dx < o°. Show that the maximum entropy distribution over 
this family is the Gaussian N (0, o°). 


Exercise 15.15 (Sharper bound for variable selection in sparse PCA) In the context of 
Example 15.20, show that for a given Omin = Min jes WA € (0, 1), support recovery in sparse 
PCA is not possible whenever 


1 +vlog(d -s+ 1) 
y? (a 


min 


n < co 


for some constant co > 0. (Note: This result sharpens the bound from Example 15.20, since 
we must have 62. < 1 due to the unit norm and s-sparsity of the eigenvector.) 


Exercise 15.16 (Lower bounds for sparse PCA in f2-error) Consider the problem of es- 
timating the maximal eigenvector 6* based on n i.i.d. samples from the spiked covariance 
model (15.47). Assuming that 6" is s-sparse, show that any estimator @ satisfies the lower 
bound 


ed 
eo v+1 slog(*) 
sup ENE- el = co ——~ 
6° €Bo(s)nS#-! v n 


for some universal constant co > 0. (Hint: The packing set from Example 15.16 may be 
useful to you. Moreover, you might consider a construction similar to Example 15.19, but 
with the random orthonormal matrix U replaced by a random permutation matrix along with 
random sign flips.) 


Exercise 15.17 (Lower bounds for generalized linear models) Consider the problem of 
estimating a vector @* € R? with Euclidean norm at most one, based on regression with a 
fixed set of design vectors {x;}"_,, and responses {y;}"_, drawn from the distribution 


i=1? 
Yi (Xi, 0) — OC, 2) 


S(O) 


n 


Poi,- -> Yn) = [| o» exp| 


i=1 


where s(o7) > 0 is a known scale factor, and ®: R — R is the cumulant function of the 
generalized linear model. 


(a) Compute an expression for the Kullback—Leibler divergence between Py, and Pg involv- 
ing ® and its derivatives. 

(b) Assuming that ||®”||,, < L < œ, give an upper bound on the Kullback—Leibler diver- 
gence that scales quadratically in the Euclidean norm ||6 — 6’||2. 

(c) Use part (b) and previous arguments to show that there is a universal constant c > 0 such 
that 


inf sup E [lie - JA > min {1 c 
0 geBs(1) 


s(T) d 
L hax n 


where Nmax = Cmax(X/Vn) is the maximum singular value. (Here as usual X € R” is 
the design matrix with x; as its ith row.) 
(d) Explain how part (c) yields our lower bound on linear regression as a special case. 
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Exercise 15.18 (Lower bounds for additive nonparametric regression) Recall the class of 
additive functions first introduced in Exercise 13.9, namely 


Fou = fi: R= R| = Ystre), 


where Y is some fixed class of univariate functions. In this exercise, we assume that the base 
class has metric entropy scaling as log N(6; Z, || - ||2) = Gy" for some «œ > 1/2, and that we 
compute L?(P)-norms using a product measure over R¢. 


(a) Show that 


inf sup EIF- fb] = d(— a 


f f€F aga 


By comparison with the result of Exercise 14.8, we see that the least-squares estimator 
is minimax-optimal up to constant factors. 

(b) Now consider the sparse variant of this model, namely based on the sparse additive 
model (SPAM) class 


Fon = ff: R? > R|f= Ystog ed. and a subset S| < s}. 


jeS 


Show that 


= a  _, slog (“) 
inf sup EMF- FRZ s(—)™ +°——. 
f SEF spam n n 
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